With increasing process variations, low-VT swapping is an effective technique that can be used to improve timing yield without having to modify a design following placement and routing. Gate criticality, defined as the probability that a gate lies on a critical path, forms the basis for existing low-V T swapping techniques. This paper presents a simulation-based study that challenges the effectiveness of low-V T swapping based on the conventional definition of gate criticality, especially as random process variations increase with technology scaling. We introduce dominant gate criticality to address the drawbacks of the conventional definition of gate criticality, and formulate dominant critical gate ranking in the presence of process variations as an optimization problem. Simulation results for 12 benchmark circuits from the ISCAS and OpenSPARC suites to achieve timing yields of 95% and 98% indicate that low-V T swapping based on dominant gate criticality reduces leakage power overhead by 61% and 42% for independent and correlated process variations, respectively, over low-V T swapping based on conventional gate criticality.
Introduction
Process variations cause significant degradation in the yield of manufactured chips [1] , and these effects are expected to worsen with technology scaling. Process variations consist of a correlated component arising from wafer-to-wafer, die-to-die, and spatially correlated within-die variations, and an independent component arising from random variations. As random variations increase with technology scaling [2] , guard-banding approaches to improve timing yield result in pessimistic designs. Since leakage power is also an important factor in determining the yield [1] , improving timing yield with minimal impact on leakage power is a significant challenge for the future.
Statistical optimization techniques to improve timing yield by optimizing circuit parameters such as gate size, V T, and VDD early in the design cycle have been proposed in literature (see [3] ). However, since the impact of process variations can be predicted more accurately after place-and-route, engineering change order (ECO) techniques based on logic restructuring, buffer insertion, gate resizing, and low-V T swapping have been proposed to improve yield by fine-tuning the design [4, 5] . Since leakage power is also strongly This research was supported by NSF CAREER Award CCF-0746850.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI '10, May 16-18, 2010 influenced by process variations, these techniques try to enhance timing yield with minimum impact on leakage power.
Low-V T swapping is a preferred ECO technique for improving timing yield since it can be applied without modifying a design following placement and routing. Several optimization-based low-V T swapping techniques to improve yield have been proposed in literature [6] [7] [8] . The dynamic programming approach in [6] stores the best low-V T swapping choices, but becomes computationally expensive for circuits with a large number of reconvergent fanout paths. Techniques based on solving a continuous-V T optimization problem, followed by heuristic techniques to discretize the V T assignments [7, 8] , either do not produce good V T assignments or become computationally demanding as more complex discretization strategies are used. Given these limitations, practical techniques for low-V T swapping based on the concept of gate criticality have been proposed [5, 9] . Gate criticality is defined as the probability that a gate lies on a critical path and several techniques for gate criticality computation have been proposed [10] [11] [12] [13] .
Conventional techniques for low-V T swapping use metrics that combine gate criticality with leakage to rank and process candidates for timing yield enhancement. However, the effectiveness of such rank-and-swap techniques diminishes with each swap since the criticality of all the gates in the design changes after every swap. This is because the distribution of critical paths, and hence critical gates, changes after every low-V T swap. Although it is possible to repeat criticality computation after every swap or set of swaps, the need to run a statistical timing and yield analyzer for criticality computation makes this approach computationally exorbitant.
To address these shortcomings, we propose the concept of dominant critical gate ranking in this paper. Dominant critical gate ranking ensures that the set of top ranked gates is a critical set of gates, i.e., it ensures that the set of gates is highly effective in improving the timing yield of the circuit. We formulate dominant critical gate ranking in the presence of process variations as an optimization problem. This optimization problem has to be solved only once to determine a ranking of the critical gates that can be effectively used to improve the timing yield of a circuit. The effectiveness of dominant critical gate ranking is illustrated by considering low-V T swapping of the top ranked gates to improve the timing yield to 95% and 98%. For 12 benchmarks from the ISCAS and OpenSPARC suites, the results indicate that low-V T swapping based on dominant critical gate ranking requires 57% and 32% fewer swaps than conventional gate criticality for independent and correlated process variations, respectively. The reduced number of low-V T swaps translates to 61% and 42% reduction in leakage power overhead for achieving the same timing yield in for independent and correlated process variations, respectively. This paper is organized as follows. Section 2 motivates dominant gate criticality. Section 3 describes optimization for dominant critical gate ranking for independent process variations. Sec. 4 extends dominant critical gate ranking to correlated process variations. Sec. 5 presents results for yield improvement and power reduction using low-V T swapping. Sec. 6 is a conclusion.
Motivation
In this section, we present results and observations for low-VT swapping based on the conventional definition of gate criticality to motivate the problem addressed in this paper. Whereas we use Monte Carlo simulations to compute the criticality of each gate in the circuit in the presence of process variations, techniques such as [10] [11] [12] [13] can also be used.
Process variations:
Our framework considers process variations arising from random dopant fluctuations (RDF), variations in oxide thickness, and variations in gate length. Variations due to RDF and oxide thickness are assumed to be independent, resulting in independent variations in threshold voltage of the gates. Variations in gate lengths are assumed to be spatially correlated. The correlation coefficient between gate g i and gate gj is given by the exponential correlation function [14] :
where d g i ,g j is the distance between gate gi and gate gj obtained after placement and α is the correlation function decay factor. α determines the degree of spatial correlation, with α = 0 and α = ∞ representing completely correlated and independent cases, respectively. The 3σ of the variations for each parameter are assumed to be 25% of the mean value. Before we present our observations on gate criticality, we examine the effect of process variations and correlations on the delay of a circuit with ten critical paths. Assume that the delay of each path has a Gaussian distribution with a mean of 15 and unit variance. Fig. 1 shows the delay distribution for 100K instances of the circuit in the presence of independent and correlated process variations. The critical path delay of the circuit is the max of the delays of the ten paths in the circuit. In the presence of independent process variations, the mean of the delay distribution is greater than 15. As the correlated component of process variations increases, the mean of the delay distribution shifts closer to its nominal value of 15, but the variance of the distribution shows an increasing trend. Hence, as the correlations increase, fewer chips fail to meet timing constraints, but they fail to meet timing by a larger value. This is not a new observation and has been noted in previous works, e.g., [14] . This observation will be useful in explaining limiting trends in timing yield improvement for low-V T swapping based on the conventional definition of gate criticality [5, 9] .
Simulation setup: Each circuit is optimized and mapped to a 45nm gate library based on predictive technology model [15] . Static gate sizing based on geometric programming [16] is used to obtain poweroptimal gate size assignment for a target delay T spec. Placement of the optimized circuit is performed using CAPO [17] . At this point, the circuit satisfies the target delay constraint T spec under nominal process conditions. However, in the presence of process variations, the circuit has a low timing yield for target delay Tspec. Techniques such as logic restructuring, buffer insertion, gate resizing, and low-V T swapping have been proposed to improve the yield in the presence of process variations [4, 5] . Low-V T swapping is preferred for improving timing yield since it can be applied without modifying a place-and-routed design. Since low-V T swapping increases leakage power, it is important to minimize the leakage power overhead during low-V T swapping to improve timing yield.
Gate criticality: Gate criticality, defined as the probability that a gate lies on a critical path, has been used in literature for low-V T swapping [5, 9] . We will show that low-VT swapping based on conventional gate criticality results in wasteful swapping of gates because the criticality of gates change after every low-V T swap. For each benchmark circuit, we use Monte Carlo simulations to obtain the critical probability of each gate in the circuit. We then rank the gates in the decreasing order of criticality for low-V T swapping. The improvement in yield obtained after each swap is graphically represented in Fig. 2 for two benchmark circuits from the ISCAS benchmark suite and two modules from the OpenSPARC T1 processor. The graph for each benchmark circuit has three yield improvement curves: (i) only independent process variations, (ii) process variations with low spatial correlation, and (iii) process variations with high spatial correlation. We make two observations about the yield improvement curves. First, the improvement in yield occurs in steps, i.e., discrete jumps in yield improvement are interspersed with regions of little or no yield improvement (flat regions). These steps are more prominent for independent process variations. The reason for the step-like improvement in yield lies in the definition of gate criticality. Since the critical probability of a gate is the probability that the gate lies on a critical path, when criticality of a path p is translated to gate criticality, all the gates on p are affected equally by the critical probability of p. However, for speeding-up the path to improve the yield using low-V T swapping, only a few dominant gates on the path need to be chosen. Hence, yield enhancement by low-V T swapping of gates in the order of their criticality leads to wasteful swapping of multiple gates on the same path instead of swapping gates on other critical paths that can lead to better improvements in yield, resulting in steps in the yield improvement curve.
Second, the number of swaps required to achieve the same timing yield increases as the correlated component of process variations increases, i.e., the slope of the yield improvement curve decreases. However, the steps in yield improvement become less prominent as the correlated component of variations increases, i.e., the discrete jumps become smaller and the flat regions become shorter. As we observed in Fig. 1 , when the independent component of variations dominates, many chips violate timing, but only by a small margin. Hence, low-V T swapping of only the dominant gates on a critical path suffices and swapping based on conventional gate criticality leads to more wasteful swapping of gates. As the correlated component of variations increases, fewer chips violate timing, but by a larger margin. Hence, low-V T swapping of a gate on a critical path leads to smaller improvements in yield (smaller jumps) and it becomes necessary to swap multiple gates on a path, resulting in less wasteful swapping of gates.
Dominant critical gates
Given a place-and-routed design, Sec. 2 described the limitations of ranking gates for low-V T swapping based on their critical probability. In the following sections, we will describe the formulation of dominant critical gate ranking problem. This section will introduce the problem formulation for independent process variations. Sec. 4 will generalize the formulation to handle correlations.
Consider a place-and-routed design such that the nominal critical path delay is equal to the target path delay δ, i.e., effect of process variations on the timing of the design has not been taken into account. In the presence of variations, to achieve a desired yield γ for a target path delay δ, the nominal critical path delay must be less than or equal to δ/s, where s ≥ 1 is called the speed-up factor. The value of the speed-up factor depends on the desired yield γ and the process variations affecting the gates on the path. The computation of the speed-up factor is discussed at the end of this section. For this discussion, it is assumed that a speed-up factor s is known.
To achieve a speed-up factor of s for the path delay, each gate g i on the path must be sped-up by a factor of si ≥ 1. We propose an optimization problem -dominant critical gate (DCG) -for computing the speed-up s i for all the gates in the circuit. The optimization problem is set up in such a manner that speed-up values obtained by solving the optimization problem reflect the dominant criticality of each gate, i.e., by ranking gates in the order of their speed-up for low-V T swapping, wasteful swapping of gates (as explained in Sec. 2) can be eliminated.
We start by analyzing the case of a single path. This will then be generalized to multiple paths and finally to a circuit later in this discussion. Consider a path with n gates and a target path delay δ. Let the mean delay of the gates on the path be δ 1, δ2, ..., δn. The speed-up of each gate g i is a variable si in the optimization problem SP and the desired speed-up of the path is a known value s.
We define the domination factor, di, of a gate gi as the ratio of the contribution of the gate to reducing the path delay to the increase in objective function when gate g i is incrementally sped-up from δ i/si to δi/(si + ). Incrementally speeding-up the gate with the largest domination factor that takes the solution closest to feasibility per unit increase in objective function and thus, the optimization algorithm will choose to incrementally speed-up the gate on the path with the largest domination factor. For a single path, the domination factor of d i of gate gi on the path is
The factor s1s2 · · · sn is common to the domination factor of all gates on the path, and thus, the domination factor d i for gate i is proportional to the delay of the gate with speed-up, i.e., δ i/si. Hence, the optimization problem SP will incrementally speed-up gates with the largest delay until the target speed-up, s, for the path is achieved. This can be argued to be intuitively correct since in the presence of independent process variations, the only systematic information available is the mean delays of the gates on the path, and hence for maximum improvement in timing yield the gate with the largest delay must be sped-up. Note that the objective function of minimizing the product of speed-ups, Q n i=1 si, ensures that a small number of gates are assigned a speeds-up value greater than 1 and thus, only the gates that dominate critical paths are sped-up.
Next, we generalize this to the case of k paths, p 1, p2, ..., p k , each with delay δ converging at a single gate g. Consider an optimization problem that generalizes Eqn. 3. Now there is one constraint for each path and the delay of the circuit is the maximum delay over all paths. If the gate g has the maximum delay on all paths, then gate g would be chosen for incremental speed-up by the same argument used for a single path. Next, if there is at least one path p i (but not all paths) on which g has the maximum delay, the optimization algorithm would again choose to incrementally speed-up gate g. This is because any other gate on p i would have a sub-optimal domination factor. Speeding-up gates on other paths would be wasteful because p i would dominate the delay of the circuit. Finally, the only case that remains is when each path p i has a gate gi (different from gate g) with the maximum delay on path p i. Similar to Eqn. 3, the domination factor, d, when gates g 1, g2, ..., g k are incrementally sped-up by 1, 2, ..., k is given by Eqn. 4. Note that the numerator min
is the incremental reduction in the delay at the output of gate g for an incremental reduction in the delay of the k paths p 1, p2, ..., p k . The summation in the denominator represents the first order i terms of the increase in the objective function. The higher order i terms, i.e.,
Since the factor s 1s2 · · · sn in the denominator is common to the domination factor of all gates, it can be dropped from the expression. Further, the equality in the expression can be converted into an inequality by replacing
Finally, the expression can be simplified using the scalar product inequality (a·b ≤ |a||b|) for the infinity norm. Thus, The domination factor, d g , of gate g where paths p1, p2, ..., p k converge, is given by δ g /sg. Eqn. 5 shows that the largest domination factor among g 1, g2, ..., g k must exceed the domination factor of g by at least a factor of k in order for gate g to not be chosen for incremental speed-up. The factor of k arises because the gates g 1, g2, ..., g k are topologically inferior to gate g since g1, g2, ..., g k lie on only one critical path whereas g lies on k critical paths. Thus, the optimization problem is formulated so that the domination factor of each gate is scaled by the number of critical paths passing through that gate, i.e., the domination factor arising due to the topology of the circuit. Thus, the problem formulation is correctly directed towards speeding-up gates that dominate the critical paths.
Finally, we generalize the optimization problem to a circuit with n gates. Since various paths in the circuit share gates, the structural properties of the circuit will play a crucial role in determining the speed-up of each gate in the optimization problem. The path-based constraints are converted into node-based arrival time constraints [16] . The optimization problem DCG is setup such that a speed-up, s, in the circuit delay is achieved collectively by speeding-up the dominant critical gates in the circuit.
where 1. si is the speed-up factor of the i th gate, 2. δi is the delay of the i th gate, 3. Ti is the arrival time at the output of the i th gate, and 4. Tspec is a specified circuit delay.
The result of this optimization problem is a speed-up value, s i, for each gate in the circuit, where s i is the dominant criticality of the gates in the circuit. We have observed that setting the speed-up, s, to a value in 1.1-1.4 gives the best results in most cases. There are two interesting observations about the problem formulation DCG.
DCG is a geometric program (GP) optimization problem in
the continuous domain. However, it is used to optimize the problem of low-V T swapping that is inherently a discrete optimization problem. A GP-based problem formulation ensures that the technique is computationally efficient and scalable to full-chip optimization.
2. DCG does not contain the notion of statistical yield or statistical timing. This is because when independent process variations dominate, the only systematic information that can be used during design is the nominal delay. DCG only uses the nominal gate delays in the problem formulation.
The solution to the optimization problem DCG provides a speed-up for each gate in the circuit. The dominant critical gate ranking is obtained by ranking the gates in the decreasing order of speed-up.
To compare algorithm DCG with conventional gate criticality based on Monte Carlo simulations, we plot a histogram of the critical weight for the top ranked gates obtained using dominant critical gate ranking (see Fig. 3(a) ). Critical weight is defined based on conventional gate criticality as the critical probability of a gate normalized by the highest critical probability among all gates. For three circuits C2670, sparc_ifu_dec, and sparc_lsu_ctl, the top ranked gates in DCG have a wide range of critical weights. The circuit C499 is an exception because the paths in C499 are well-balanced, resulting in a critical weight close to 1 for most of the gates. A wide range of critical weight gates arise because algorithm DCG assigns a high rank to only the dominant critical gates on critical paths and then ranks other gates on less critical paths that offer higher potential for timing yield improvement, even though they may have a small critical weight. We support this claim by plotting the timing yield improvement of the circuits using low-V T swapping. This plot is shown in Fig. 3(b) for the four benchmark circuits and a target yield of 98%. The plot marked MC represents the timing yield improvement obtained by ranking gates based on conventional criticality metric and the plot marked DCG represents the timing yield improvement obtained by ranking gates based on algorithm DCG. On average, algorithm DCG requires only about half the number of low-V T swaps as compared to MC for the same yield. The lower number of low-V T swaps translates to lower leakage power overhead to achieve the same timing yield, as reported in Tables 1 and 2 for all benchmark circuits. 
DCG with correlated process variations
In this section, we extend algorithm DCG to handle correlated process variations. The use of nominal gate delays, δ i, in algorithm DCG is justified when independent variations are a dominant component of the total process variations. This is because gate delay variations are decoupled in the presence of independent process variations and the only systematic information about postmanufacturing delay of the gates available during design is their nominal delay. However, when the correlated component in process variations is also considered, the nominal gate delays do not accurately capture the post-manufacturing delay distribution of the gates, and thus we would expect the algorithm DCG to be less effective. For the exponential spatial correlation model, we observed that DCG was less effective when the correlation between two gates separated by a unit distance accounted for more than 15-20% of the total variations at each gate.
Algorithm DCG can be extended to leverage correlation information in the process variations. This is accomplished by using correlated gate delay distribution samples instead of nominal gate delay, δ i, in algorithm DCG. The correlated gate delay distribution samples can be generated from the correlation matrix of the process variations. The speed-up s i obtained for each gate gi from algorithm DCG is then averaged over multiple correlated gate distribution samples to obtain the speed-up in the presence of correlated process variations. The final speed-up value defines the dominant gate criticality ranking of the gates. In this approach, the quality of the solution depends on the number of correlated gate distribution samples used. However, we observed that for the largest benchmark circuits, using more than a few hundred samples offers diminishing returns in terms of improving the quality of the solution and the increase in runtime. In this paper, we average the speedup of each gate over 1000 correlated delay distribution samples to obtain the final speed-up for each gate.
As noted in Fig. 1 , from the perspective of timing yield, correlated process variations cause fewer chips to fail, but by a larger timing margin. Hence, from a dominant gate criticality perspective, multiple gates on a path would have a high dominant gate criticality rank in order to counter the effect of process variations on the path. This effect is captured by the correlated delay distribution samples and thus incorporated in the final speed-up of each gate. The plot of critical weights shown in Fig. 4(a) illustrates this effect. Although the overall distribution of critical weights is similar to the case of independent process variations shown in Fig. 3(a) , the number of gates chosen for each criticality weight is higher. Fig. 4(b) compares the yield improvement curves for the four benchmark circuits using algorithm DCG and conventional gate criticality.
Results
In this section, we present and compare results for yield improvement using low-V T swapping based on gate criticality ranking obtained using algorithm DCG and the conventional definition of gate criticality. The effectiveness of each criticality ranking will be assessed based on the number of low-V T swaps and leakage power overhead required to achieve a target yield. The comparison will demonstrate the effectiveness of algorithm DCG in identifying small sets of dominant critical gates to achieve the same timing yield with a lower leakage power overhead as compared to ranking them using conventional gate criticality.
The techniques are compared using 12 benchmark circuits from the ISCAS benchmark suites and modules from the OpenSPARC T1 processor. The simulation setup used for comparison was described in Sec. 2. On average over various gates in the library, low-V T gate cells improve the delay by 20% and increase the leakage power dissipation of a gate by 11X. Tables 1 and 2 present results for independent and correlated process variations, respectively. The name and number of gates for each benchmark circuit is reported in the first two columns of the tables. The critical probabilities of the gates is obtained using 100K Monte Carlo runs for each benchmark circuit. The gates are then ranked in decreasing order of critical probability for performing yield improvement based on low-V T swapping. The results for this technique are reported in the column MC in the tables for target yield of 95% and 98%. Results for yield improvement based on dominant critical gate ranking obtained from algorithm DCG for the same target yields are reported in the column DCG. The number of low-V T swaps and the leakage power overhead over the base design (without any low-V T swaps) are reported in the columns "No. swaps" and "Leakage ovh.", respectively, for each technique and yield combination. The runtime for each technique in seconds is indicated under "Runtime".
Results indicate that timing yield improvement using low-V T swapping of gates based on algorithm DCG requires 57% and 32% fewer swaps than the conventional metric of gate critical probability for 
