Abstract-In this paper, we propose a statistical gate sizing approach to maximize the timing yield of a given circuit, under area constraints. Our approach involves statistical gate delay modeling, statistical static timing analysis, and gate sizing. Experiments performed in an industrial framework on combinational International Symposium on Circuits and Systems (ISCAS'85) and Microelectronics Center of North Carolina (MCNC) benchmarks show absolute timing yield gains of 30% on the average, over deterministic timing optimization for at most 10% area penalty. It is further shown that circuits optimized using our metric have larger timing yields than the same optimized using a worst case metric, for iso-area solutions. Finally, we present an insight into statistical properties of gate delays for a commercial 0.13-m technology library which intuitively provides one reason why statistical timing driven optimization does better than deterministic timing driven optimization.
I. INTRODUCTION
A N increasing significance of variability in modern deep submicrometer integrated circuits necessitates statistical approaches to timing analysis and optimization. Researchers have proposed multiple approaches to statistical static timing analysis [2] - [6] in the past few years. A majority of these approaches consider circuit component delays as Gaussian random variables since it facilitates fast analytical evaluation. Timing analysis involves add and max operations. A max operation on Gaussian random variables is nontrivial. Chang et al. [3] and Visweswariah et al. [5] propose to approximate the maximum of multiple Gaussians with a Gaussian using Clark's approach [7] to obtaining the max of two Gaussians. Pairwise max operations are, thus, employed in the computation of the maximum of multiple Gaussians, each of which involve approximations. However, none of the above approaches describe the impact of the ordering of pairwise max operations on the resulting inaccuracy in the final solution.
Multiple approaches to statistical timing optimization have emerged recently. Agarwal et al. propose a sensitivity-based gate sizing algorithm, and faster approaches that perform sensitivity calculation based on slack computation [8] , to minimize the 99-percentile point of a circuit's delay distribution. Intra-die variability is considered, and gate delay variations are assumed to be 10% of their nominals. A robust gate sizing methodology based on geometric programming is proposed by Singh et al. [9] . They incorporate an uncertainty ellipsoid to model variations and attain to optimize circuit area under worst case timing constraints. Guthaus et al. [10] propose a gate sizing algorithm to optimize circuit area while satisfying a given timing yield target. They employ a sensitivity metric to select gates for resizing. Our experiments conclude that node and edge criticalities evaluated in their approach can only be estimated in closed form to be within 20% of those obtained from Monte Carlo simulations. This is due to the assumption of independence between the criticalities of any two paths while evaluating a node or an edge criticality. As a result, they may be inadequate for guiding timing optimization.
In this paper, we present an approach to area constrained statistical timing yield optimization that involves statistical modeling, statistical timing analysis, and gate sizing. We do not focus on improving a given percentile point of a circuit's delay distribution, but attain to maximize the probability that given timing constraints are met, under variations. Statistical gate delay modeling is performed for a commercial 0.13-m technology library from a foundry. We employ Visweswariah's approach [5] for statistical static timing analysis, and present a formal proof that validates their variance matching methodology used in the computation of the maximum of two Gaussians. We also consider a smart ordering for pairwise max operations on Gaussians during the computation of the maximum of multiple Gaussians. It is observed that the ordering achieves accuracy improvements in the final solution. Gate sizing is performed using a statistical global sizing algorithm. We prove that maximizing the timing yield of a circuit is equivalent to maximizing a simple expression involving the mean and the standard deviation of the circuit's slack distribution. Experiments performed in an industrial framework show absolute timing yield gains of 30% on the average in comparison to a commercial synthesis tool for an area overhead of at most 10%. We observe that for iso-area solutions, our metric obtains larger timing yields than optimization for the worst case slack. Finally, we present insight into statistical properties of gate delays from a commercial technology library which intuitively provides one reason why statistical timing driven optimization does better than deterministic timing driven optimization.
The rest of this chapter is organized as follows. Sections II and III present our approaches to statistical modeling and statistical static timing analysis, respectively. In Section IV, we propose our statistical gate-sizing algorithm for timing yield optimization, and present experimental results in Section V. We provide insight into statistical properties of gate delays in Section VI, and draw conclusions in Section VII. 
II. STATISTICAL MODELING
Statistical delay modeling involves expressing circuit component delays as functions of the parameters of variation, which we model as Gaussian random variables. Based on the work in [3] and [5] , we assume that gate delays are approximated by a linear function of the parameters. We also assume that these parameters are independent, since a dependent set of Gaussian parameters can be transformed into an equivalent set of independent Gaussian parameters using principal component analysis [3] . Circuit component delays are, therefore, expressed as (1) In the previous expression, denotes the mean or nominal value of the delay, 's represent the variations of global parameters 's from their nominal values, and 's and denote the delay sensitivities to their corresponding sources of variation. represents the variation from the nominal of an independent random variable that is associated with each component, and denotes the delay sensitivity to . To compute the delay sensitivities for any gate in the circuit, we obtain precharacterized gate delay values as functions of their loading capacitance and input slews (based on deterministic timing analysis at nominal corner) at multiple corners in the parameter space. The parameters are normalized by subtracting their nominal values followed by a division by their standard deviations. A least-squares fit is finally employed to obtain the desired delay sensitivities that express the gate delay as a linear function of normal random variables, as expressed in (1). This procedure is repeated for each gate in the circuit. Fig. 1 shows precharacterized delay values for some inverter in a circuit at multiple corners in a 2-D parameter space. A least square fit of the obtained points results in a plane, the slope of which in the two coordinate directions give the sensitivities of the inverter delay to the parameters, respectively. The inverter delay is, thus, obtained as a weighted linear sum of Gaussian random variables.
III. STATISTICAL STATIC TIMING ANALYSIS
Statistical static timing analysis requires propagation of delay distributions through the circuit. This involves add and max operations on the delay random variables. Since we express circuit component delays as a linear combination of Gaussian random variables, the add operation is performed in a straight forward manner and yields another Gaussian. In this work, we employ Visweswariah's approach [5] to computing the maximum of two Gaussian delay random variables and , which are expressed as a weighted linear sum of normal random variables as in (1) . We denote the (mean, variance) of and as and , respectively, where the 's and 's represent delay sensitivities. We use to denote the correlation coefficient between and , and define the following:
The mean and variance of are computed as follows (Clark's approach [7] ):
Approximation of with a Gaussian having a canonical form is performed as follows (Visweswariah's approach [5] ):
, in the previous expression, denotes the tightness probability of over , that is, the probability that dominates . Our first contribution to this approach is that we formally validate the variance matching approach in (10) . We prove in the appendix that is always nonnegative. This implies that the variance matching approach never involves the computation of the square root of a negative quantity. Required time estimation in statistical timing analysis is performed by a backward propagation of delay distributions and involves the subtract and min operations. These operations are similar to the add and max operations.
When a gate has more than two fan-ins (fan-outs), the max (min) operation for the arrival (required) time distribution calculation is done one pair at a time, each step of which involves approximations. We observe that an arbitrary order of these pairwise operations may accumulate errors and can significantly affect the accuracy of the final solution. We employ a greedy approach for smart pairwise max (min) operations based on the approximation error computations [11] . Slack estimation during timing analysis involves subtract operations which can be performed on the canonical forms of the timing distributions. A min operation on the slack distributions at the primary outputs gives the circuit slack.
IV. STATISTICAL GATE SIZING
We formally define the timing yield of a circuit to be the probability that the circuit slack is nonnegative. This probability can be computed by integrating the slack probability density function (pdf) from 0 to . Given the circuit slack (after statistical timing analysis) as a Gaussian random variable with mean and standard deviation , the timing yield of the circuit is given by (11) In this work, we attain to maximize the timing yield of a circuit using gate sizing, under given area constraints. We next prove that maximizing the timing yield is equivalent to maximizing the ratio of the mean to the standard deviation of the circuit slack distribution.
Theorem 1:
Proof: We define . Under variable transformation which is strictly increasing with . This proves our claim.
Our statistical gate sizing approach, thus, attains to maximize the metric , under area constraints. For sake of comparison, we also consider maximizing the metric , under identical area constraints; such an objective function attains to maximize the worst case slack. We design a statistical global gate sizing (SGGS) algorithm for timing yield optimization as an extension to the global gate sizing algorithm [12] . Our choice of the global sizing algorithm is motivated by results obtained by Coudert et al. [12] , which show that this algorithm is superior to common greedy or genetic approaches to circuit optimization in terms of performance and power/delay curves. The proposed algorithm considers the circuit as a network of nodes with a global cost function Cost that is to be maximized under given area constraints Area . The global cost function used in our approach is the metric , where and denote the mean and the standard deviation, respectively, of the circuit slack distribution . Each node in the network is implemented using some gate from the given technology library. Multiple gates, each belonging to the same gate class as the node, can be mapped to a given node. We refer to this process as resizing a node. The variation in the global cost due to resizing a node is denoted its gradient for a particular resize operation. We define the local cost of a node as the ratio of the mean to the standard deviation of the slack distribution at its output. Variations in the local cost of a node due to various resizing operations are termed as corresponding local gradients.
We describe the algorithmic flow next. A set update maintains a list of nodes whose gradients are to be computed. This set is initialized with all nodes in . Another set moves maintains a list of nodes that can potentially be resized. This set is initially kept empty. For any node , the gradient computation for each possible resize involves a run of statistical timing analysis on the entire circuit. This makes the gradient evaluation computationally very expensive. In practice, we observe that the impact of a node resize on the local gradients decrease quickly (approximately geometrically [12] ) with increasing fan-in and fan-out level. We, therefore, extract a subnetwork for each node in update, which is made out of two transitive levels of fan-in and fan-out around . The inputs and outputs of the subnetwork are annotated with the corresponding arrival and required time distributions, respectively, from the original network . Statistical timing analysis is now performed on and the local gradient at the output of this subnetwork is used as the metric for evaluation. Unless no possible resize operation on improves this metric, the new gate involved in a possible resize that maximizes this metric is termed as the best-gate for . The node and its best-gate are now added as a possible resize operation to the set moves. However, the resize is not actually performed at this stage.
Following the above procedure for each node in the set update, a MultiMove routine picks a subset of possible resize operations from the set moves that provide maximum cumulative gain in the global cost. These resize operations are then performed and the resized nodes are returned in a new set moved. The MultiMove routine determines the subset for the move based on the descent direction or by a conjugation of directions of the cost gradients [12] . In our experiments, we employ a greedy heuristic that chooses the best two nodes for resize in terms of yield improvement in each MultiMove operation. A new set of nodes whose gradients need to be recomputed are now derived from moved in the function PerturbedNodes. In our approach, we choose a node for gradient recomputation only if it is sufficiently perturbed, that is, if one of its close neighbors (within one or two transitive fan-in or fan-out levels) has been resized. The entire process is repeated till convergence, wherein future iterations do not improve the global cost (timing yield of the circuit) further or till the runtime/area constraints of the design are violated. For comparison, this procedure is repeated starting with the original design, using as both the global cost function and the local cost function. The complexity of this algorithm using the best-fit polynomial is shown to be , where denotes the number of internal nodes [13] . The pseudocode of the SGGS algorithm is presented in Fig. 2 .
V. IMPLEMENTATION AND EXPERIMENTAL RESULTS
The proposed statistical modeling, statistical timing analysis, and gate sizing routines are implemented in an industrial framework, as an addition to a commercial synthesis and optimization tool. Experiments are performed on combinational International Symposium on Circuits and Systems (ISCAS'85) and Microelectronics Center of North Carolina (MCNC) benchmarks mapped to a 0.13-m commercial technology library from a foundry.
For our experiments, we choose and temperature as the parameters of variation. We acknowledge that these parameters may have a nonlinear impact on delays. However, precharacterized gate delay values were available for a commercial 0.13-m library that we intended to use in our experiments. It was not immediately possible to recharacterize these gates for other parametric variations, and we did not use artificial values for the same as done in a majority of other mentioned approaches to statistical optimization. In any case, our approach is not limited to the use of any particular parameters of variation. 
We consider
variations in the range of 1.08 to 1.32 V. The nominal value is set to 1.2 V and the standard deviation is set as the following:
Similarly, we consider temperature variations from 0 to 125 C, with nominal temperature as 25 C and standard deviation set to 8.33 C. For any characterization point , the delay equation is set up as the following:
represents the typical delay obtained from gate characterization at and . This formulation is scalable to any number of parameters. A least squares fit procedure is employed to obtain the coefficients s. The accuracy of this approach is dependent on the number of characterization points that are available in the library.
Statistical timing analysis is next performed to obtain the global circuit slack distribution , with mean and variance . Timing yield of the circuit is obtained from (11) . Table I shows obtained statistical timing analysis results. We present Table I , we observe that the average and maximum error in the estimation of the mean and standard deviation of the circuit delay distribution is under 1% and 4.1%, respectively. SSTA is found to be faster than Monte Carlo simulations by 42.2 on the average.
For timing yield improvement estimation, we perform deterministic timing optimization on a given circuit using a commercial synthesis tool, which attains to improve the circuit slack under area constraints. Statistical timing analysis is then performed to obtain the slack distribution at the primary output of the circuit, the mean of which we denote as . We next perform statistical timing optimization using our proposed gate sizing approach to obtain a new circuit slack distribution. To estimate the relative gain in timing yield, we compute the relative timing yield of the circuit after the deterministic and statistical optimization passes as the area under their respective circuit slack PDFs from to . Fig. 3 shows this relative timing yield improvement graphically as the area of the black region minus the area of the striped region. We next repeat this procedure using the alternate metric as the cost function during statistical optimization instead of our original metric . Table II presents obtained relative timing yield improvements for both the optimization objective functions. We observe our proposed metric achieves timing yield improvements of 0.3 on the average, and up to 0.5 with an area overhead of at most 10%. Corresponding average and maximum timing yield improvements using the alternate metric are found to be 0.27 and 0.49, respectively (for identical area overheads). It is, thus, shown that the proposed approach guides better optimization than that for maximizing the worst case slack, under iso-area constraints. For the design alu2, the alternate metric worsens the yield.
We next present a special case of timing yield improvement observed for the MCNC benchmark APEX6. The three PDFs in Fig. 4 denote the slack distributions for the unoptimized circuit (Init), circuit following deterministic static timing optimization (Static) and circuit following statistical timing optimization (SSTO). The reduced variance of the SSTO slack PDF improves the timing yield (from 0.87 to 0.89) even though it has a smaller mean as compared to Static slack PDF. This example illustrates how statistical optimization uses the additional information on variation to achieve larger timing yields, even for iso-area solutions. The proposed algorithm takes less than 480 min for the largest benchmarks on a 400-MHz Sun Ultra 4 machine with 4-GB RAM. The primary reasons for large run times include multiple calls to statistical timing analysis that performs smart pairwise max operations [11] ; and an exhaustive search of the best-gate for any node in the inner loop of the algorithm. 
VI. ANALYSIS OF STATISTICAL PROPERTIES OF GATE DELAYS
We perform an analysis of statistical properties of gate delays on different gate classes from our 0.13-m commercial technology library. We select some nodes arbitrarily from a test circuit; and observe the mean and the standard deviation of the arrival time distribution at each of their outputs, while mapping different gates on them (the different gates belong to the gate-class of the node, for example, NAND or NOR). Fig. 5 presents a plot of the arrival time standard deviation (Sigma) against the arrival time mean for a class of inverters. Dots on the plot represent gates which are sorted on the mean of their output arrival times when mapped to the given node and not in any order of their sizes. Fig. 6 presents similar graphs for two classes of AND gates.
We observe that though most gates of a class make the plots monotonic, there exist exceptions. In some cases during our experiments, we observe that while the deterministic timing driven optimizer resizes a node to a gate with a smaller mean arrival time ignoring the fact that it may have larger variability, the statistical timing driven optimizer selects a gate with a larger mean arrival time, but a significantly lesser variance. Such a choice is found to increase the overall timing yield of the circuit. This behavior provides one reason why statistical timing driven optimization gains an edge over deterministic timing driven optimization.
VII. CONCLUSION
In this paper, we propose a statistical gate sizing approach to maximize the timing yield of a given circuit under area constraints. Experiments performed in an industrial framework on combinational ISCAS'85 and MCNC benchmarks show timing yield gains of 0.3 on the average, over deterministic timing optimization for at most 10% area penalty. It is further shown that circuits optimized using our metric have larger timing yields than the same optimized using a worst case metric, for iso-area solutions. Finally, we present an insight into statistical properties of gate delays for a commercial 0.13-m technology library which intuitively provides one reason why statistical timing driven optimization does better than deterministic timing driven optimization.
Though this work considers delays as a weighted linear sum of Gaussian random variables, the statistical timing yield improvement approach can be extended to handle nongaussian parameters and nonlinear delay functions as proposed in [14] . However, obtaining a simple metric for timing yield optimization would be a challenging problem.
APPENDIX
Using notations defined in (6), (7) , and (9), we prove that the variance matching method in (10) never involves the computation of the root of a negative quantity. Formally, we prove that .
Proof:
To show , it is sufficient to show that If . For positive (since ), it is sufficient to show that is symmetric and is found to be nonnegative for all real values of . For values of approaches 0 with both and tending to 0. Fig. 7 shows the plot of as a function of .
