Static timing analysis is a critical step in design of any digital integrated circuit. Technology and design rrends have led to signifcant increase in environmental andpracess variations which need to be incorporated in static timing analysis. This paper presents a new, efficient and accurate blockbased static timing analysis technique considering uncertain@. This new method is more efficient as its models arrival times as cumulaiive density functions (CDFs) and delays as probability functions (PDFs) 
Introduction
Static timing analysis (STA) is critical to the measurement and optimization of the circuit performance before its manufacture. Full chip static timing analysis is usually performed using efficient block-based techniques. A block-based approach allows incremental, embedded static timing analysis and therefore enables timing-driven flows in logic synthesis and physical design. Hence, block-based static timing analysis has emerged as one of the key technologies in current design methodologies.
The timing or performance of the chip is heavily dependent on the manufacturing process variations (e.g. Vt, Length, etc.) and design environment variations (e.g. VDD & temperature variations, noise impact on timing, etc.). As the feature sizes decrease, the ability to control the manufacturing spread or accuracy of a given feature size is also decreasing. Along with increased process variations, the uncertainty caused by design is also increasing. The increase of uncertainty in design is caused by increase of power supply and temperature variations and interconnect loading uncertainty such as coupling noise impact on timing. Another source of uncertainty is the inherent error in the gate delay models, also cailed the model-to-hardware correlation error. It is critical that these increased timing uncertainties be handled in the design process in an efficient and accurate manner. Given the pervasive nature of static timing, it is essential that a variation-aware static timing approach be suitable for full chip designs.
Design variations or uncertainty in static timing
Chandramouli Kashyap
IBM Microelectronics
Austin, TX 78758 vchandra@us.ibm.com analysis is typically handled in two broad ways. The first set of techniques handle variations by worst casing the circuit response. In such a scenario, static timing is performed at various design corners (e.g. fast, slow and nominal design corner). For example, the fast corner is computed by placing all the gates (or transistors) at the fast corner and performing a regular deterministic timing analysis. The timing results of the fast, slow and nominal corners can also be combined to minimize the typically large error of worst-case analysis. This approach is computationally attractive but can be inaccurate due to its worst-case nature. The worst-case approach has traditionally been used for industrial designs but it is becoming inapplicable as the timing variations continue to increase. Furthermore. to account for intra-chip or local variations, these techniques scale the data and clock path delays differently using empirical factors. [ll] , the delays of the gates and arrival times are modeled as independent discrete random variables. Reconvergent fanouts are not considered. False path analysis using this basic framework is considered in [12]. A new approach described in [16] proposes a technique which computes both upper and lower bounds to the exact solution, in the presence of reconvergent fanouts. Further, they show that statistical STA performed without accounting for reconvergent fanouts is an upper bound on the actual delay. However, their method of enumerating selected nodes to obtain improved bounds may be cumbersome for large circuits, and has exponential runtime in the worst case. A further drawback of these block-based approaches is that they model both gate delays and arrival times as discrete probability density functions or PDFs. This involves propagating impulse trains across the circuit and taking the statistical maximum of two arrival times, a fundamental operation in STA, becomes inefficient. As we show in this paper, modeling the arrival times as cumulative distribution functions or CDFs is more efficient.
In contrast to the above block-based methods, a pathbased approach has been proposed in [13] . Each Parts of the circuit (or gates) can be modeled as deterministic. The deterministic part of the circuit is modeled as step CDFs (or impulse PDFs). Piecewise linear CDFs imply piecewise constant PDFs for the gate delays.
Dependency in statistical timing analysis can come from two type of sources. The first source is reconvergent fanout due to the circuit toplogy. The second source of dependency is the manufacturing process parameters. The gates of the circuit which depend on same or similar process parameters cause correlation in delay and arrival times. This paper addresses the dependency caused by reconvergent fanout. which is a necessary first step in a statistical STA framework. Reconvergent fanouts are efficiently handled by a novel common mode removal approach using the idea of a statistical "subtraction" as opposed to expensive path-tracing commonly used in the literature.
Far simplicity, the discussion assumes a late-made STA, however, the proposed method can be easily applied to early-mode STA as well. The remaining part of the paper is organized as follows. Section 2 describes the basics of the new statistical timing analysis approach. Techniques to account for reconvergent fanout are presented in Section 3. Section 4 presents results for various ISCAS benchmark circuits followed by conclusions and future work in Section 5.
Statistical Timing Analysis
The problem in deterministic static timing analysis is to compute arrival times at the output nodes. Using this the slack and hence the critical path of the circuit are determined. Arrival times at the input and delay of the gates are specified as deterministic numbers. In case of statistical timing analysis, the arrival times and delays of the gates are specified as distributions. In general, the distribution of delays of the gates can take any form (i.e. normal, uniform, etc.). The problem in statistical timing analysis is to compute distribution of arrival times at the intermediate nodes and the output nodes. Given the required arrival time and distribution of output arrival times, critical paths and slack distributions can be computed for a given probability or confidence level.
Timing analysis is performed by levelizing the circuit. The arrival time at the input is propagated through the gates at each level till it reaches the output. Propagating the arrival times through a gate is a key function in static timing. In deterministic static timing analysis, arrival time A , at output node o is given by:
(1)
Computation of max and addition is straight fonvard in regular timing analysis. We now define these operations in statistical timing analysis. In the proposed approach arrival times are modeled as cumulative density functions (CDFs) and the delays are modeled as probability density functions (PDFs). At any node i in the timing graph, C;(I) is defined as the notation for its CDF. For delay between node i and node j , P, ( I ) is defined as the PDF and Ci.(i) is defined as the CDF. From the definition of CD# and PDF, P , ( t ) is the derivative of C, (t) . Using this notation the max and addition operations are defined next. For now, these expressions assume that the variables (arrival time, delays) are independent.
Techniques to handle interdependence due to reconvergent fanouts will be presented later in Section 3.
Addition Operation:
Let Djj be the delay between node i and node j . Or Statistical timing analysis can be added to a regular static timing engine by replacing the fundamental max and add operation in Equation (1) by the one in Equation (8). Circuit parsing and setup, timing graph construction, graph traversal and incremental capabilities of regular timing can be used as is in statistical timing analysis. Techniques described in this section can be used for any model or form (i.e. normal, uniform, measured, etc.) of the variations (or the CDF). Since we use piecewise linear models for CDF, in the next section we show how the add and multiplication is performed in this framework.
Max and Addition with PWL modeling
As mentioned previously, convolution is performed to add the inout arrival time -which is modeled as a the result of convolution is a quadratic. For an n -piece CDF and an n -piece PDF, a total of n convolutions are performed. The n2 quadratics are then summed together to obtain the resultant CDF. Finally, for forward propagation the quadratic CDF is converted back to a piecewise linear CDF by sampling at the oreset Drobabilitv values. No nonlinear iterations the two CDFs: n2 qGadratic pieces are produced due to multiplication. These are then summed up to get the resultant CDF. As before, the CDF is sampled at the preset probability values to get the piecewise linear CDF for forward propagation.
Time Complexity of the Proposed Method
It is well known that block-based deterministic timing analysis can be performed in O ( E + V) where E and Y are the number of edges and vertices in the timing graph respectively. In the statistical case, convolution is performed for each edge in the timing graph and the multiplication is performed at each vertex for all the incident edges in the timing graph. Since each of these operations takes O(n2) time for an n -piece CDFlPDF model, the overall complexity of our approach is O(n2E + V ) .
are required for the sampling since dosed-form formulas for the crossing times can be obtained since 
Handling Reconvergent Fanouts
The complexity of statistical timing analysis usually increases due to reconvergent fanounts. In this section, a new technique is presented to capture reconvergent fanouts in the proposed statistical timing framework. We illustrate the basic principle behind our approach through the circuit in Figure 8 . In this example, two paths originating from node r reconverge as inputs to the same gate at nodes i and j . This causes both the amval time A i and Aj to depend on arrival time A,. This dependency potentially complicates the computation of arrival time A , since equation (5) can no longer be used.
Path 2 from r lo j It should be noted that the interdependence of arrival time A , and AI has a very specific linear form. That is
The variable of interest. arrival time at node o , A , , is given by
The computation in Equation (11) can be exactly rewritten as
The expression is simplified by taking out the common mode A , . Since D I , 0 2 , D,, ana D ,,,, are independent, simple expression derived in Section 2 can be used. In other words, the CDF at node o can simply be rewritten as where C; is the CDF of arrival time at node r , C , is the CDF of delay D , , Cz is the CDF of delay 0, ana is the convolution operator defined in Equation (3). Note that the expression within the square bracket is a CDF. Therefore. the dash at the end denotes the derivative (which yields the PDF) of the expression within the square brackets. This is required so that convolution with C , produces the right CDF.
To compute A,,, the compLtation of U , and D2 is required. One way is to perform path tracing to get these valbes. However, it is simpler and more desirable to compute DI and D 2 through statistical subtraction. i.e. 
The statistical subtraction is equivalent to inverse of convolution, and one way to do this is by moment matching. Consider the case A, = A , -A , (1 6) where A, and Ay are known and A , is the unknown. This can be rewritten as
Since z and y are independent, their mean and variances will add in a convolution. That is,
2 -2 2 ax -uz + uy Since P, , C,, Py and Cy are known, their mean and variances can be computed from their PDFs (or their piecewise linear CDF). Hence, the required mean and variance of C, can be computed from algebraic subtraction, or 
(22)
Once the mean pz and variance at is computed, the CDF, C, , can be determined by fitting the mean and variance to a probability distribution. Two moments (mean and variance) are matched to determine the distribution. This method can also be extended by matching higher order moments and performing Pade approximation to determine the CDF. 1 ' Other logic gate 5 Figure 9 A general case of reconvergent fanout In general, an input of a gate may depend on more than one previous node. For example, in Figure 9 , the inputs of the gate 5 depends on A, 6, C and D. Some of these vertices may also share subpaths when reaching the inputs of gate 5. Therefore, when computing the arrival time at the output of gate 5, this dependency must be accounted for. To accomplish this, we maintain a Dependency List (DL) with each vertex in the timing graph which lists the vertices on which the arrival time of the current vertex depends.
The vertices are sorted by the level in a descending order i.e. the most recent vertex (i.e. the one with the highest level) appears first and so on. The DL is propagated as we compute the statistical arrival times using the DLPfopagate algorithm shown in Figure 10 .
In the algorithm DL, denotes the DL of the i th input and DL, denotes the DL of the output node. If an input does not contribute to the output of the gate (for example, it may be 1 well before any other input arrives), its DL is not propagated to th output. In addition, we use two other pruning heuristics to limit the size of the DL. We allow the user to specify the size of the list. In addition, we only carry-forward the n most recent (i.e. level) vertices where n is again set by the user. Thus, we know all the prior vertices that impact the arrival time of a given vertex without any path tracing.
Algorithm DL Propagate DL, = NULL if gate output has fanout > 1 add output node to DL, for each input iof gate contributing to output for each node v in DL, and not in DL, add Y in DL, using insertion sort in descending order by level return DL, Now, suppose we wish to compute the arrival time at the output of a multi-input gate, with each input having possibly a non-empty dependency list. An approximate algorithm is given in Figure 11 . The key idea is to reduce the dependency of each input to a single vertex so that (13) Once a dependent max is computed at the output of a gate, th dependency lists of the inputs is not propagated forward.
For example, referring to Figure 9 . the output of gate 5 depends on A , B , C and D . However, depMax will identify that inputs 1 and 2 depend on B (since B is at a higher level than A ) and that inputs 3 and 4 depend on C . Thus the arrival times due to 1 and 2 (as well as 3 and 4) will be computed using (13) and the two resultant arrival times will be treated as independent and combined using (5) to get the output arrival time.
The dependency lists of the inputs to gate 5 will not be carried forward.
Results
The proposed block-based static timing analysis approach in presence of uncertainty has been implemented and its results are presented for various ISCAS benchmark circuits. The ISCAS circuit have been mapped using a commercial logic synthesis system to a recent library consisting of gates with maximum of four fanins. The cell library consists of the following gates: inverter, 2-input nand, 2-input nor, 2- The accuracy of the proposed method is further illustrated in Table 2 . This table shows the results obtained by Monte Carlo and the proposed method for the 1% point in the CDC. The error compared to exact Monte Carlo is small and varies from 0.09% to 2.42%.
The proposed method can be run with different modeling complexity of the piecewise linear (PWL) CDF. All the CDFs in each circuit are modeled as a 3-point PWL model, a 5-point PWL model and a 7-point PWL model. Accuracy comparisons of these three modeling levels is shown in Table 3 . As expected, the accuracy of the proposed technique decreases with reduction in number of segment in the PWL approximation. However, even the 3-pt. PWL model gives good accuracy. The computational cost of different CDF PWL models is illustrated in Table 4 . Performance is shown as the ratio to the 3-point PWL statistical timing run. For example, in circuit C880, the 5-point PWL model takes 2.5 times longer to run as compared to the 3-point PWL model and the 7-point PWL model takes 4.5 times longer to run as compared to the 3-point PWL model.
Accuracy of statistical timing with and without reconvergence fanout handling in shown in Figure 12 and Figure 13 . The figures show the CDF of the arrival time of the most critical output as computed by Monte Carlo, proposed method without reconvergence and proposed method with reconvergence. The proposed method with reconvergence handling compares very well to Monte Carlo validating our dominant common mode algorithm of Section 3. Performance impact of reconvergence handling in the proposed method is shown in 
