s-CDMFF-lay saves 16% power at 50% activity and saves 28% power at 25% activity. Note that the results are different to those in Table II , due to differences in circuit topologies (especially the extra inverter to produce CK1), and s-CDMFF-lay is optimized in terms of power and area instead of power and delay. The s-CDMFF-lay has about 34% increase in size and about 34% less in IRI. The spreads of power, delay, and IRI are better than those of TGFF-lay.
I. INTRODUCTION
As CMOS process technologies scale down towards nanometer regimes, the accuracy and efficiency of static timing analysis (STA) has become increasingly important to the successful timing closure of an integrated circuit design flow. Most STA tools break the analysis into two parts: 1) gate delay calculation and 2) interconnect or wire delay calculation. It is widely accepted that computing gate delays using a lumped total capacitance and computing the wire delay using the Elmore model are grossly inadequate for the wire dominated designs of today. To address this drawback, various model order reduction techniques such as AWE [2] , PRIMA [9] , etc., have been proposed to accurately model the interconnect delay. On the other hand, the load, as seen by the driving gate, is modeled by a reduced-order model such as the -model [5] . Therefore, an "effective capacitance" technique was proposed [6] which provides a way to map the -load to an equivalent capacitance (in the sense of gate propagation delay).
While these approaches exhibit good accuracies and are used for sign-off level, they can be too computation extensive to be used in the context of physical design optimization. Recognizing this shortcoming, there has been much research on deriving closed-form formulas or delay metrics for wire delay estimation [3] , [4] , [8] , [12] . However, these delay metrics introduce lots of error to the STA results and are not reliable enough to be used for optimization. On the other hand, not much attention has been given on speeding up the gate delay calculation, which as we show next, accounts for a significant portion of the overall STA runtime.
We measured the time spent in various parts of a commercial sign-off STA tool on many designs including two 90-nm technology microprocessor designs, which we call Design#1 and Design#2. TABLE I  90-nm MICROPROCESSOR DESIGN SPECIFICATIONS   TABLE II  DESIGN #2 FOM RESULTS the tool spends on the "gate timing analysis," "interconnect timing analysis," and the whole STA runtime. We find that, on average, about 60% of the CPU time of STA is spent on the gate timing analysis.
For accuracy purposes, figure of merit (FOM) metric has been used to measure how poor the distribution of negative slacks is (i.e., worstnegative slack at end points) in the design. "FOM integral" represents the summation of all the negative slack endpoints in the designs. This metric is chosen because it captures how many paths are timing critical and need to be fixed. In comparison, the worst slack gives one number that indicates the worst negative slack of the design that need to be fixed. "FOM Number" is the number of negative slack end points.
We applied different combinations of interconnect timing analysis algorithms (AWE or Elmore) and gate timing analysis algorithms (effective capacitance and lumped capacitance) on many designs including both Design#1 and Design#2. As an example, we have provided the FOM results for Design#2 in Table II . It can be derived that although Elmore metric is efficient but can change the FOM results by orders of magnitude. In addition, C total can change the FOM results by orders of magnitude with respect to the golden FOM results (i.e., using AWE for interconnect timing analysis and C e for gate timing analysis). Thus, it is important to have new interconnect and gate timing analysis algorithms which are capable of accurately and efficiently calculating interconnect and gate delay and slew along a path.
Our first contribution in this paper is to present a filtering technique for speeding up the interconnect timing analysis step in an STA tool, while maintaining a reasonable level of accuracy. As we will see later in this work, Elmore delay-based algorithms could be accurate for some cases of nets or interconnects and we do not need to use higher order moments-based algorithms for delay and slew calculation. For some other cases, we may need to use two moments for the interconnect delay and slew calculation, where a new efficient metric has been proposed. Finally, for other cases, we may need to use AWE-based algorithm for interconnect delay and slew calculation.
Our second contribution in this paper is to present a filtering technique for speeding up the gate delay calculation step in an STA tool, while maintaining a reasonable level of accuracy. The filtering technique resorts to a necessary condition check to determine if C total can be used for the gate delay and/or output slew calculations without introducing a significant inaccuracy.
The remainder of this paper is organized as follows. In Section II, we present the threshold-based filtering algorithm for fast interconnect timing analysis. In Section III, the fast gate timing analysis is presented. Section IV presents the concluding remarks.
II. FAST INTERCONNECT TIMING ANALYSIS
In this section, we focus on the interconnect timing analysis. Elmore [1] used the first moment of the impulse response transfer function and approximated the median (the desired delay) by the mean of the impulse response. It is well established that the Elmore delay metric can be off by orders of magnitude in some cases. To conquer the accuracy problem, different delay metrics have been proposed by using higher moments [3] , [4] , [8] , [12] . These delay metrics try to use a fixed number of moments to find the delay and slew, accordingly. Using the fact that, for some nets, Elmore-based delay is accurate enough and for some nets, the delay metrics based on two or higher moments should be used; therefore, none of the previous works would give accurate, yet efficient, results.
This section presents a threshold-based filtering algorithm (TFA) for propagation delay and output slew calculation of high-speed VLSI interconnects. The TFA partitions the circuit nets into three groups based on their top-level characteristics: one group of nets called low complexity nets, lend themselves to accurate delay calculation with the Elmore delay whereas the second and third groups of nets called medium and high complexity nets, demand more sophisticated and time-consuming delay calculations based on the first two or higher moments of the impulse response transfer function, respectively. The idea of dividing the circuit nets into different classes for the purpose of minimizing the computational workload of a delay calculation engine while providing reasonable accuracy for the computed delays is quite intuitive and straightforward. The key challenge, however, is in being able to do the examination and classification of the nets accurately. This is precisely what we accomplish in this section by our threshold-based filtering algorithm, as will be shown later.
The remainder of this section is organized as follows. In Section II-A, by using the circuit theory, a new analytical closed-form equation for calculating the delay and output slew of an interconnect line under step and ramp inputs is presented. Section II-B uses these analytical equations as a signature function to sort the nets into simple and complex ones. Experimental results are reported in Section II-C.
A. Analysis of the Threshold-Based Filtering Algorithm
The ratio of the voltage of the output node V o (s) to the input voltage Vi(s) for a linear time-invariant (LTI) system is called the voltage transfer function H(s). For an RC tree, this ratio can be written as
where mi is called the ith moment of the voltage transfer function. If a unit ramp input with 0 % rise time of T in(0) is applied to such an RC segment, then the 0 % output transition time can be written as [4] , [7] , [12] T out(0) = T in (0) 2 + RC ln 100 0 100 0 2 :
Based on (2), if the ratio of the input slew to the corresponding RC value for two different RC circuits is the same, then the ratio of their output transition times to the RC values will be the same. Considering the RC value to be an indicator for Elmore delay of a more general RC tree, this fact implies that the Input_slew/Elmore is a key characteristic for the delay calculation, and interestingly, one of the most important factors when determining the degree of accuracy of an Elmore delay calculator. Therefore, for an RC tree, the output slew can be calculated as T out(0) = T in (0) 2 + Elmore 2 ln 100 0 100 0 2
where T out and T in denote the transition times at the output and input nodes of the RC tree and Elmore denotes the Elmore delay.
For an RC tree, considering only the first-order moment in delay calculation implies that the second-order moment is the square of the first moment, which is not always true due to the shielding effect of the wires. In general, this m2=m 2 1 ratio varies from a number smaller than 1 to almost 50. Therefore, we need to consider the effect of higher moments. By considering the first two moments of the impulse response transfer function, we can approximate H(s) bỹ
As a result, we approximate the 0 % output transition as
where is a function of m 2 =m 2 1 . In addition, by approximating the step response of a second-order system, we calculate the value in (5) 
This linear approximation is accurate enough for the analysis and helps us to understand the sensitivity of the delay and slew calculation to the shielding effect. However, one can use higher order terms and get a more accurate value.
The values of and are calculated and shown in Fig. 1 . From Fig. 1 and (5) and (6), since is multiplied by m2=m 2 1 , it is obvious that the 10% to 90% of the transition time is sensitive to the m 2 =m 2 1 change. It also shows that the around 70% point is not as sensitive to the value of m2=m 2 1 (and thereby to the shielding effect) as the 50% transition or any other points are. Fig. 2 shows this scenario for different values of m 2 =m 2 1 . More precisely, if m 2 =m 2 1 changes by 20%, the 10% point to 90% point transition time changes by as much as 43%, whereas the 70% point output transition time changes slightly. Fig. 1 also can help us to understand how much error we can incur in our delay/slew analysis if we do not consider higher moments (m2; m3; ...) for calculating the propagation delay and slew.
Based on (4), considering only the first two moments of the impulse response transfer function is equivalent to assuming that the third moment is equal to 2m 1 m 2 0m 3 1 . Interestingly, the output transition times are not sensitive to m3=(2m1m2 0 m 3 1 ) as much as they are sensitive to the m 2 =m 2 1 . However, to have an accurate interconnect timing analyzer, when m 3 =(2m 1 m 2 0 m 3 1 ) becomes larger than a critical value, the AWE method needs to be used to find the delay and slew.
The advantage of this methodology is that the latter scenario occurs rarely in today's high frequency digital circuits. Indeed, the m3=(2m1m2 0 m 3 1 ) is linearly dependent on the m2=m 2 1 . Thus, whenever m 2 =m 2 1 value exceeds a critical limit, the effect of third moment should also be taken into account by using the AWE method. This critical limit can change according to the degree of precision needed during the path timing analysis. Step response of a second-order system for three values of m =m .
B. Filtering Algorithm
As observed earlier, the Input_slew/Elmore is an extremely important factor in determining the propagation delay and slew. When the value of Input_slew/Elmore becomes greater than a critical limit, then there is one dominant pole in the voltage transfer function, and therefore, the first moment would be sufficiently accurate for calculating the output delay and transition time. It can be observed that the Elmore delay and Elmore slew errors are functions of the Input_slew/Elmore. If the Input_slew/Elmore is greater than the critical threshold, the Elmore delay error is quite negligible. However, when Input_slew/Elmore is less than this threshold, the Elmore delay may result in a considerable error. The proposed filtering algorithm makes use of this behavior to classify the stage delays based on the critical value of Input_slew/Elmore. The parameters used in the filtering algorithm are defined as follows.
Elmore threshold value. When the first moment of the voltage transfer function is less than this threshold, then the estimation errors of the slew and delay (which are calculated based on Elmore metric) are small because the critical path delays are not sensitive to these estimation errors. Dominant-pole cut off ratio. When the value of the input slew over Elmore delay is greater than , then the Elmore-based timing analysis is accurate enough.
Second moment filtering-threshold value. If the value of m 2 =m 2 1 is less than this threshold, (5) becomes the basis of the timing analysis. For an interconnect line with m2=m 2 1 greater than this threshold, the AWE method should be used to calculate the higher moments. As goes towards 1, the delay and slew calculations become more accurate but the runtime increases.
Therefore, given the input slew Tr, the TFA for calculating the stage delay is as follows. 
Threshold-Based Filtering Algorithm

C. Experimental Results
To verify the accuracy of the proposed filtering technique, the algorithm was applied to many high-performance designs including De-sign#1 and Design#2. The design specifications are shown in Table I . All the experimental runs of the proposed algorithm were done on a 2.0-GHz X86-based PC with 2 GB of RAM. The sign-off FOM results (using AWE for interconnect timing analysis and C e for gate timing analysis) are shown in the first row of Table III for Design#1  and Table IV for Design#2. We changed the interconnect timing analysis algorithm from AWE to Elmore and D2M and reported the results in the previously mentioned tables. As it is shown, the FOM results change by orders of magnitude when we apply Elmore and D2M, however, the runtime decreases significantly. We also applied the TFA algorithm using = 4 ps, = 7, and = 1:44. The proposed filtering algorithm improves the interconnect timing analysis runtime by 65%. In addition, TFA resulted in a very small amount of error in FOM results compared to AWE-based delay calculator results. For Design#1, the max/average/min errors are 6%/1%/-2% while for Design#2 the max/average/min errors are 8%/1%/-3%. Decreasing and and increasing tends to increase the accuracy at the expense of higher runtime. In fact, the filtering algorithm with ! 0, ! 1, and ! 0 simply resort to the AWE-based timing analysis. Similarly, with ! 0, the proposed filtering algorithm reduces to the Elmore-based for delay and slew calculation.
Evidently, there is a tradeoff between efficiency and accuracy when choosing the threshold parameter values. As an example, is the threshold value that filters those cases where the Elmore delay calculator returns a small delay value. Definition of "small" is, however, design and technology dependent: 10 ps in 180-nm technology may be a small value while in 90-nm technology it may not be considered small anymore. One can choose different values in different stages of the design flow, starting with a large value but choosing smaller ones as he/she proceeds from earlier design stages toward the sign-off stage.
III. FAST GATE TIMING ANALYSIS
In this section, we present a filtering technique for speeding up the gate delay calculation step in an STA tool, while maintaining a reasonable level of accuracy: The filtering technique resorts to a necessary condition check to determine if C total can be used for the gate delay and/or output slew calculations without introducing a significant inaccuracy. The motivation for the filtering approach is given in Fig. 3 , where it is shown that the distribution of the "actual effective capacitance" over the "total capacitance" in a design is highly skewed towards one. As shown in Fig. 3 , for "Design#2," the mean of the distribution of C e =C total ratio is equal to 0.97. We have observed similar behavior in many other large industrial designs. This section is organized as follows. Section III-A reviews the background and previous work in the area of gate timing analysis. Section III-B describes the filtering technique previously mentioned for speeding up the gate delay and slew analysis. Experimental results are reported in Section III-C.
A. Background
In VDSM technologies, we cannot neglect the effect of interconnect resistances of the output loads. Using the sum of all load capacitances as the capacitive load is simple, but can be quite pessimistic [11] . A more accurate approximation for an nth-order load shown by the gate (i.e., a load with n distributed capacitances to ground) is to use a second-order RC-: model [5] . Therefore, the "effective capacitance" approach has been proposed [6] , [10] , [11] , whereby the RC-load is approximated by an equivalent capacitance, C e .
All of effective capacitance approaches resort to the iterative calculation of C e for the given circuit scenario, which can be costly in the context of physical design optimization tools. In this section, we present a filtering approach that resorts to a necessary condition check to determine if the C total algorithm is sufficient for evaluating the delay and/or output slew of logic gates, and thereby, avoid effective capacitance calculations.
As shown in Fig. 3 , for most cases of the gate timing analysis, C e is very close to C total , i.e., if we are able to identify these cases, it will then be possible to use C total algorithm for the gate delay and/or output slew calculation for these cases, and employ the C e algorithm for the remaining cases. To find out the type of the STA case that we must perform on a circuit configuration, we resort to an efficient and accurate condition check.
1) Problem Statement: Given is a CMOS driver whose input rise time is Tin and drives an output RC-load. The problem is to find a robust and efficient necessary condition check to distinguish between cases that can be accurately handled by using the C total algorithm and those cases that need the iterative C e algorithm for gate propagation delay and/or output slew calculation during the physical design optimization process.
B. Proposed Filtering Technique
In our quest for a robust and efficient necessary condition, we start with the effective capacitance definition. Based on its definition, the effective capacitance C e is a pure capacitance that replaces an RCload and has the property that it stores the same amount of charge as the RC-load until a certain point of the output voltage transition (e.g., the 50% point of the output transition). We assume that the output voltage waveform for the CMOS driver behaves as a combination of ramp and exponential waveforms and, therefore, actual C e must be obtained as a simple average of the C e obtained for ramp output waveform and the C e obtained for exponential output waveform.
In the following, we calculate C e for ramp and exponential waveforms of the gate output voltage. Modeling gate output waveform as 
We have also derived that if the output voltage of a gate is approximated 
Now, based on the assumption made, the iterative equation for actual C e calculation for any % point of the output transition time can be represented as C e () = C n + [k Exp () + (1 0 )k Ramp ()]C f (9) where 0 1 is the linear combination factor for exponential and ramp waveforms. However, we observed that using = 0:5 shows the minimum error between the iterative C e equation in (9) and the actual sign-off C e value. We will refer to single iteration of (9) as the condition check formula. Fig. 4 compares the plots of C Exp e , C Ramp e , and C e for delay calculation using single iteration of (9) over "C total " on the y-axis versus the "actual sign off C e " for delay calculation over "C total " on the x-axis. To do single iteration of (9), we use the output slew of the gate, when the gate sees the total capacitance as the load. Subsequently, we calculated "k Ramp " and "k Exp " and "(k Ramp + kExp)=2." As shown in this figure, the single-iteration C e using (9) is reasonably close to the actual sign-off C e value.
Before starting the discussion for filtering algorithm, we define the threshold parameter . is the parameter which separates the cases that utilize efficient delay calculation from the cases that employ iterative C e for delay calculation. To find out the type of the STA scenario that we encounter in practice, we resort to (9) . First, we calculate the slew of the gate for the total capacitive load. Next, we find C e by using a single-iteration of (9) and the output slew from the previous step.
If C e =C total is greater than a prespecified threshold value , then we call the gate library and find the gate propagation delay for the obtained C e . If C e =C total , then we will have to resort to a more accurate way of calculating C e (use of the Thevenin equivalent circuit for the driver) and obtain the gate propagation delay and output slew values. We report the results of the filtering technique for different threshold values for Design#1 and Design#2 in Section III-C.
C. Experimental Results
To compare the accuracy and performance of the proposed technique, the algorithm is applied on many high-performance industrial designs, including Design#1 and Design#2. Some of the characteristics of these two designs are shown in Table I . For accuracy purposes the FOM metric has been used. We performed several experiments on Design#1 (cf. Table V) and Design#2 (cf . Table VI ). For the gate timing characteristics, we used the sign-off level gate library which contains detailed and accurate k-factor equations for describing the timing behavior of the logic gates. These equations are functions of the input transition time, the output load, V dd , temperature, process parameters, etc. Since, we observed = 0:5 introduce minimum error with respect to sign-off C e calculation, in this section, we set = 0:5 in (9). Experiment 1 is the golden experiment in terms of accuracy since it uses sign off STA for the timing analysis of the design. Experiments 2-7 apply the proposed filtering approach with different filtering threshold values. As experiment 4 indicates, = 0:95 gives a reasonable accuracy of within 1% error, while it improves the runtime a lot. Experimental results indicate that filtering algorithm improves the runtime of the sign-off C e by about 50%, while introducing an error of only 1% to the FOM results. Experiment 9 makes use of C total algorithm. As shown in Table V, the FOM results for experiment 9 suffer from very large errors. The single-iteration effective capacitance is used in experiment 8. As it is shown, the error in the results is much less compared to the C total algorithm while the runtime is comparable to the runtime of C total algorithm.
As mentioned before, the threshold values in the filtering algorithm are designer, technology, "step in design flow" dependent and a designer can choose these threshold parameter values based on his/her own tradeoff between desired accuracy and runtime, starting with a small value but choosing larger ones as he/she proceeds from earlier design stages toward the sign-off stage. One can run a few test cases for each class of designs and in each technology node to obtain the threshold parameter values for the filter. So deriving these parameter values is rather straight-forward, but must be tailored to a particular design and technology.
IV. CONCLUSION
In this paper, first, a threshold-based filtering algorithm for estimating the interconnect delay and slew in high performance interconnects was presented. The proposed algorithm filters a set of nets for timing evaluation using the Elmore-based delay and slew calculation engine. Furthermore, a closed-form expression for calculating the delay and slew was provided for those interconnect lines with m2=m 2 1 less than a certain critical threshold. Experimental results on large industrial designs show that the filtering technique resulted in a negligible error of 1% error while exhibiting about 65% improvement in the interconnect timing analysis runtime. Next, a threshold-based filtering technique was proposed to speedup the gate delay and slew calculation in VDSM technologies. It was observed that the distribution of the "actual C e over C total " ratio in industrial designs is highly skewed toward one which led us to a novel filtering algorithm. This algorithm utilizes the C total for most circuit scenarios and a C e algorithm for the remaining rare scenarios. Experimental results on large industrial designs show that the filtering technique resulted in a negligible error of 1% error while exhibiting about 50% improvement in the gate timing analysis runtime.
