Abstract
Introduction
Recent advances in VLSI have continued to shrink device geometries at a steady rate in accordance with Moore's Law. However, this advancement has also been accompanied by increasing variations in the performance of fabricated circuits. Numerous factors have contributed to this trend including clock PLL jitter, noise, PV model inaccuracies, and manufacturing variations. Nevertheless, it is often desirable to manufacture ASICs on advanced technology nodes due to substantial increase in available device count, reduction in power consumption, higher yields and lower costs due to the larger 300mm wafers.
Researchers have recently focused on statistical analysis approaches in an attempt to grapple with these sources of performance variations. Statistical timing analysis models delay arcs as random variables and propagate timing constraints using probability distribution functions (pdfs). While a substantial focus has gone into the analysis aspect of this problem [1, 2] , recent research into statistical optimization of circuits has been surprisingly diminutive.
Circuit optimization was done in [3] by using LANCELOT [4] but had severe limitation on circuit size and used unrealistic delay models. A concept of criticality of gates was used in [5] but did not address the variance of the timing path delays. A transistor level approach was presented in [6] . Several yield-specific techniques were presented in [7] . In this paper we present a unique approach that identifies worst negative statistical slack (WNSS) paths analogous to traditional worst negative slack (WNS) paths. Our method also provides flexibility for optimization objective function by assigning weights that enable user-driven tradeoffs between mean and variance of circuit performance.
The remainder of this paper is organized as follows:
x We present background on proposed research x We formulate the problem of performance variability reduction in presence of statistical delays x We derive a method for tracing the worst negative statistical slack (WNSS) path in a circuit x We derive and demonstrate efficacy of a new approximation for quick calculation of the mean and variance of the maximum of random variables x We present a robust gain-based sizing approach that handles a weighted sum of means and variances of delays x Experimental results are presented and analyzed
Related work

Gate sizing
Gate sizing has been studied extensively in the literature. It is typically performed after technology mapping during logic synthesis and repeated throughout the design process. The aim of gate sizing is to assign sizes to gates in a circuit such that a performance objective function is satisfied.
where Although sizing approaches relying on convex assumptions or analytical delay models have been proposed, more recent approaches tend to tackle the problem using greedy heuristics. According to [8] , accurate delay models make gate sizing a non-linear, non-convex, constrained, discrete optimization problem. Most greedy gate sizing algorithms share several common elements [8, 9, 10, 11] . The critical path, sometimes referred to as the Worst Negative Slack (WNS) path, is usually targeted for optimization. We note that the WNS path can change as the optimization proceeds so the path being evaluated for resizing must be updated regularly during sizing iterations. The algorithms can be run in a constrained mode where delay for example is optimized first then area is recovered as far as possible without violating a delay constraint. 
Statistical static timing analysis
The focus on use of statistical approaches in timing analysis is relatively new. Pioneering works in this field appeared in [12, 13, 14] . However, in the past few years statistical techniques for timing analysis of circuits have received tremendous focus with representative works including [15, 16, 17] . Static timing analysis relies on two operations for propagating timing through a network, sum and max. Performing these calculations on pdfs is more expensive computationally than their counterparts in the deterministic case. Moreover, the correlation between two pdfs needs be taken into account for accurate calculations. Such a circuit will typically exhibit the widest spread in performance due to high usage of smaller devices which exhibit more manufacturing variability. Depending on target application of circuit, such a performance variance around the center can represent undesirable uncertainty that should be minimized. In [18] , reduction of uncertainty was shown to be a key strategy for designing leading edge industrial designs. Decreasing variance can increase the overall yield of a design. An example of this is optimization 1 in Fig. 1 which yields more functional units at period T relative to the original design. However, our technique is quite general and is not limited to yield maximization. Decreasing performance variance is also desirable on several other accounts even if it means relaxing the original timing targets. For example, circuits on the original curve to the left of "X" in Fig. 1 below will exhibit undesirable variance in power consumption due to both dynamic and leakage power variations. These variations in turn contribute uncertainties in thermal dissipation and reliability verification. The effects of such performance variations can adversely product qualification and time-to-market. In such instances, the 2 O RV nd optimization point shown below becomes desirable due to better tolerance to manufacturing variations. Our research is aimed at providing designers with a statistically aware gate sizing methodology that allows arbitrary tradeoffs between mean and variance of .
O RV
Problem formulation and motivation
The starting point for our problem is a technology mapped digital circuit. Without loss of generality, this paper focuses on combinational circuits. We ignore interconnect delay though accounting for them can be readily accommodated. In fact, we postulate that our algorithm can help overcome the inherent interconnect uncertainty during pre-layout convergence by treating interconnect delays as random variables.
Our method uses discrete probability distribution functions (pdfs) throughout. A discrete pdf for random variable X is defined as one or more points where . The mean and variance of a discrete random variable are given by
We assume that every gate delay in the circuit is represented by a normally distributed random variable which is consistent with the literature. Arrival times are propagated throughout the circuit as pdfs. We define the unconstrained timing variance minimization problem for a circuit as Our full statistical analysis engine is based on [15] . This approach discretizes pdfs at a user controlled sampling rate. We used 10-15 samples per pdf as a reasonable tradeoff between accuracy and speed. The operations sum and max are performed on discrete pdfs using shifting, scaling, and min/max reduction. In addition to propagating pdfs, we also calculate the mean and variance at every node and store these values for use in the fast timing engine (FASSTA). This component in our algorithm can be updated as needed to track the latest emerging research in statistical timing analysis and represents the outer loop for our iterations. 
Proposed approach
We studied several deterministic sizing techniques to evaluate their fitness as a basis for statistical sizing. Our preference for accurate gate delay models steered us away from methods [19, 20, 21] , which require convex analytical expressions for gate delays. Such models not adequately capture the nonlinearities in current and foreseeable DSM technologies where manufacturing variations are prevalent. Our proposed approach is shown in Fig. 2 . It builds on the deterministic algorithms presented in [8, 11] . We show next how we deal with new challenges that arise when timing constraints are represented by random variables.
We start with two normally distributed independent random variables A and B with expected values To calculate the max, we shall expand on the formulation in [22] . We use the following notation: These formulae cannot be evaluated directly because the integrals do not have analytical expressions and are expensive to compute. We show next how they can be avoided altogether. We reformulate the integral: dt
where erf denotes the error function. To calculate the error function, we use the following quadratic approximation [23] which is accurate to two decimal places°°® We also note that the error function is odd:
These formulae give us a quick method to approximate the error function for any value. We substitute this approximation in (1) and (2). We note that if
Our justification for taking the partial derivatives with respect to the means of the delays is that the variances have a random component not under our direct control. and we have 
V P | |
We observed that in the vast majority cases, one of (5) or (6) would apply obviating need for any calculation for max, while in other cases the approximations above provide quick estimates. These formulae assume independence of random variables which does not always hold. However, this approach emphasizes speed while retaining a reasonable degree of accuracy for small subcircuits. We stress that this approach is only used for the inner loop of the optimizations, while the outer loop relies on the more accurate discrete pdfs manipulation approach that can track correlations due to reconvergent paths using Principal Component Analysis [17] or other methods as long as runtime is managed appropriately.
Statistical critical path identification
As was pointed out in section 2.1, circuit optimization engines typically focus their effort on the critical or WNS path to improve the performance of the circuit. This section describes how we extend this concept to trace the Worst Negative Statistical Slack (WNSS) path in a circuit.
Consider a circuit consisting of 6 gates such as the one shown in Fig. 3 . The first number in the parenthesis represents the statistical mean of delay for that arc while the second one represents the standard variation. We wish to determine the critical path with the biggest contribution to the variance at the output of node X. We note that, unlike the deterministic case, one cannot simply pick the input with the higher mean or variance to determine which input is most responsible for the variance at the output. This is due to the non-linearity of the statistical max operation where all inputs contribute to the output max.
We proceed to solve this problem by considering the sensitivity of the variance at the output of a node with respect to the inputs as follows. Starting from a given gate, we compare its inputs pair-wise. If either of (5) or (6) are satisfied, then we pick the input with the higher mean as clearly having the dominant influence on the output of this gate. If neither of these equations is satisfied, we compare One approach to obtaining these sensitivities is to differentiate (3) directly. We found the resultant expressions to be complex and would require expensive floating-point computations. Instead, we chose to use an approximation for differentiation as follows. Rewriting ) , , , (
We use a forward finite-difference formula to approximate the partial derivative: 
X
We used values for h of the order of 1% of the mean. It should be noted that P and V along a given path are correlated and one cannot expect to change one value without the other being impacted. The change in A V that can result out of altering A P is indicated by g. We also note that it is impossible in general to determine g accurately as the relationship between P and V along a given path is governed by a combination of gate performance variations inversely proportional to their dimensions as well unsystematic random variations that are unpredictable. For purposes of ranking inputs, the following linear approximation linking these two was found to be adequate:
We used values for c equal to those assumed to relate mean delay through a gate to its variance.
Subcircuit extraction and ranking
For every gate being evaluated for resizing, our algorithm extracts a subcircuit around this gate based on a user-controlled depth. We have found that using two levels of transitive fanins and fanouts is sufficiently accurate without being too costly to evaluate. For every available size for this gate, we use FASSTA to calculate mean and variance of delay at the outputs of this subcircuit. In order to rank the the relative merits of gate sizing in this subcircuit quickly, we use the following cost function. For all outputs of the subcircuit O 1 ..O n , we calculate a weighted sum of mean and standard variation:
where O is a user-specified weight multiplier that ranks relative importance of minimizing standard variation against mean of delay. By choosing higher values for O , the user can place more emphasis on variance reduction. We provide more analysis on effect of varying O in the conclusions section at the end of the paper. The cost of the subcircuit is given by the maximum of Cost(O i ) across all outputs. We then pick the gate size that minimizes subcircuit cost across all gate sizes for candidate gate.
Experimental results
The proposed approach was implemented in Java and run on an Intel PC running at 2.53 GHz. We tested the algorithm on various circuits from the ISCAS benchmarks and various sized ALU circuits. The circuits were first synthesized using Design Compiler [24] using an industrial 90nm lookup-table based standard cell library with 6-8 sizes per gate type. In line with other researchers, we added variations to the gate delays based on [25, 26] . Two variations components were added to the gate delays: one proportional to delay through gate and another random source corresponding to unsystematic manufacturing variations. Several observations can be made from these results. Our algorithm consistently reduces the standard variation while increasing mean delay and area. This behavior is expected since our algorithm favors bigger gate sizes that reduce the variance of delay across them. The algorithm's focus on minimizing variance also causes it to upsize gates near the outputs to reduce the overall variance at circuit's output. This is done even if that path does not have the highest mean delay which is in contrast to a worst mean-delay optimizer which would not upsize such gates. This increases overall delay due to higher loading slowing predecessor gates.
Another important observation is that the number of gates along a timing path is inversely proportional to the variance along that path and the ability to optimize it away. Paths with a shorter number of gates tend to be more susceptible to variations. The smaller ALU circuits exhibit significant variations as a percentage of their mean. Our algorithm can reduce this variation substantially but at a higher increase in area. On the other hand, circuit C6288 which is a 16x16 bit multiplier has the longest depth of any of the circuits in the table. We note that it has the lowest improvement due to its already low V to P ratio. 
Concluding Remarks
We introduced a new concept of a worst negative statistical slack path and derived a procedure for tracing and optimizing such paths. In the process, we also derived a new approximation for the max operation on random variables for use in circuit optimization. Our approach allows us to steer the optimization process towards different mean-variance goals. The significance of this work is that it can be used during design cycle to increase tolerance for the effects of manufacturing variations by trading off circuit delay and area requirements for reduced timing variance with user controlled weights. We demonstrated fidelity of our approach on ISCAS benchmarks with consistent variance reduction in exchange for moderate increases in area and low increases in mean delays.
