Advanced process technologies call for a proactive consideration of process variations in design to ensure high parametric timing yield. Despite its popular use in almost any high performance IC designs nowadays, buffer insertion has not gained enough attention in addressing this issue. In this paper, we propose a novel algorithm for buffer insertion to consider process variations. The major contribution of this work is two-fold: (1) an efficient technique to handle correlated process variations under nonlinear operations; (2) a provable transitive closure pruning rule that makes linear complexity variation-aware pruning possible. The proposed techniques enable an efficient implementation of variationaware buffer insertion. Compared to an existing algorithm considering process variations, our algorithm achieves more than 25× speed-up. We also show that, compared to the conventional deterministic approach, the proposed buffer insertion algorithm considering correlated process variations improves the parametric timing yield by more than 15%.
INTRODUCTION
Advanced process technologies impose significant challenges for modern IC designers as we move into the ultra-deep submicron era where manufactured circuits exhibit substantial process variations. Proactive consideration of process variations during the design stage is critical to ensure high parametric timing yield. Studies on process variations include variability modeling [1] , statistical static timing analysis (SSTA) [2, 3] , and recently statistical optimization, like gate sizing [4] and power reduction [5] . While critical for almost any of today's high performance IC designs, however, buffer insertion has not gained enough attention in considering process variations, despite the fact that the deterministic buffer insertion problem has been studied extensively with different objectives in literature [6, 7] .
One of the difficulties in solving the buffer insertion problem considering process variations is to model buffer solutions as random variables that are correlated. Such correlations are not only due to the shared global variation or spatial variation as encountered in SSTA, but also due to the way buffer solutions are computed. Because almost all existing buffering algorithms to some degree follow the same dynamic programming paradigm [6] , where solutions are computed recursively from downstream nodes, solutions from the same subtree are inherently correlated. Moreover, operations involved in computing different buffer solutions are nonlinear, like the minimum and multiplication operations, which further complicates the problem of handling the correlated variations. Another difficulty is how to define the pruning rule between different solutions in the presence of process variations. Because without an efficient pruning rule, a straight-forward implementation of the dynamic programming based buffering algorithm would make its complexity to increase exponentially.
To the best of our knowledge, there are four recent publications that have attempted to solve the buffer insertion problem considering process variation [8, 9, 10, 11] . However, none of them has addressed the above difficulties with definite answers. For example, [8] considered only the effect of wire length variation albeit the fact that wire length variation is not a typical process variation; and it was assumed that there was no correlation between different solutions. Moreover, three heuristic pruning rules were proposed with none of them can bound the complexity of the algorithm. [9] proposed to capture the correlation between solutions by using the joint probability density function (JPDF), and compute the JPDF numerically to handle the nonlinear operations of the correlated random variables. A two-side threshold based pruning rule was proposed to reduce the runtime complexity. However, the complexity of computing JPDF numerically is high, yet the effectiveness of the two-side threshold based pruning rule is not clear. [10] studied a similar buffer insertion problem but without modeling the interconnect variation as random variables, thus avoiding the problem of handling nonlinear operations of random variables. Under the assumption of an ideal setting (like infinitely long two-pin nets) with simplification, [11] showed that buffer insertion is not sensitive to process variation. However, it is not clear whether or not that statement still holds for designs under the real setting (e.g., finite wire length and multiple pins).
The contribution of this work is as follows. We propose a novel algorithm for buffer insertion to consider correlated process variations. We develop an efficient technique to handle the correlated process variations under nonlinear operations. We present a provable transitive closure pruning rule that makes linear complexity variation-aware pruning possible. Equipped with the above techniques, we show an efficient implementation of the variation-aware buffer insertion algorithm. Compared to the existing algorithm considering process variations [9] , our algorithm achieves more than 25× speed-up. We also show that compared to the conventional deterministic approach, our buffer insertion algorithm considering correlated process variations improves the parametric timing yield by more than 15%.
PRELIMINARY

Deterministic Buffer Insertion
For a given buffered routing tree, two figures-of-merit are associated with every legal buffer position t in the tree: i.e., the input loading capacitance (or downstream loading capacitance) Ct and the required arrival time Tt. The basic buffer insertion problem formulation is to find the locations of buffers in the given routing tree such that the required arrival time (RAT) at the root is maximized [6] . In the context of dynamic programming, we first traverse the routing tree in the reverse topological order and propagate solutions bottom up while book-keeping all intermediate solutions. At the end (root), we pick the optimal solution with the largest RAT. By backtracking the chosen optimal solution, we determine the optimal solutions for each sub-tree recursively.
We characterize a device (or buffer) in terms of its gate capacitance (C b ), intrinsic delay (T b ) and output resistance (Rb). For a given interconnect segment in the layout, we characterize it by its lumped resistance Rw and capacitance Cw. When each interconnect segment in the routing tree is modeled by a π model, under the Elmore delay model, Ct and T t can be computed recursively as follows. By adding a wire at node n, all solutions at n are propagated to the other end of the wire t as follows
By adding a buffer at node n, all solutions at n are propagated to the input of the buffer as follows
If the solution at node t is obtained by merging two solutions from its two sub-trees rooted at nodes m and n, respectively, then solutions at the merging point are computed by
Knowing the above three key operations, the deterministic dynamic programming based buffer insertion can be solved by recursively applying the above three operations to obtain new solutions as we traverse the routing tree bottom-up.
Process Variation Modeling
To model the impact of process variation on both device and interconnect characteristics, we represent these characteristics of interest as random variables, which are complicated functions of the underlying process parameters. We employ the first order approximation to capture the major component of the variation. The rational is that if the underlying parametric variations are small, any nonlinear relationship can be reasonably captured by a first order approximation. Such an approximation has been well accepted for statistical timing analysis [2, 3] . Mathematically, it can be described as
where A is the characteristic of interest, and F i are the underlying physical process parameters which are not independent of one another in general. The nominal value of A is a0, while sensitivities of A with respect to Fi are given by ai. If the function that describes the characteristics of interest with respect to the process parameters is known, we can obtain (7) analytically via the first order Taylor expansion of the function. Otherwise, SPICE simulation can be used to extract the nominal value of a0 and the sensitivities of ai. We further identify the following three sources of variation to model the cause of variations in process parameter Fi: the inter-chip global variation, the intra-chip spatial correlation, and the purely random variation. The three sources of variations are independent by definition, as the mechanism of causing those variations are different.
where fi,0 is the nominal value for Fi, Xi,g is the interchip global variation that is shared among all Fi within the same chip, Y i,s is the intra-chip spatial variation, and X i,r is the random variation. To capture the spatial correlation between different Yi,s, we represent the spatial variation part as follows:
where X i,k are independent random variables, and d i,k are coefficients of X i,k , and determine the amount of correlation between different Fi. We can obtain (9) by using the methods in [12, 13] . The authors of [12] , proposed to obtain (9) by using a quad-tree like spatial correlation model, while in [13] the authors proposed to obtain the coefficients of d i,k by measuring the distance between F i 's physical location and the modeled X i,k location. By plugging (9) and (8) into (7) we obtain
which is further simplified as
It is easy to see that (10) is a first order canonical form for A, where X is a random vector that includes the inter-chip global source of variation X g , the spatial correlation X i,k , and the random variation X i,r ; and α is the corresponding coefficient vector of X. Because all random variables in X are mutually independent and follow a standard normal distribution, i.e., X ∼ N (0, I), the mean value of A is given by A0 in (10), while the variance of A is given by α T α.
VARIATION AWARE BUFFERING
To consider process variations, we model all characteristics of interests, like T b , C b and R b for devices, and Rw and Cw for interconnect, as a random variable in canonical form, i.e., Tb = Tb0 + γ
Applying these random variables to (1) to (6), we obtain solutions Ct and Tt that are functions of T b , C b , R b , Rw and C w , hence they are also random variables. However, because of the non-linear operators (multiplication and minimum operations) involved in computing the new solution, the distributions are no longer in the first order canonical form. To compute the exact probability density function (PDF) of Ct or Tt after the above non-linear operations, [9] resorted to a numerical method where hyper-dimensional integration is performed to obtain the distribution of C t and T t exactly. However, such an approach is not desirable and it greatly complicates computation due to the following two reasons. First, the exact computation of the PDF after nonlinear operations is both difficult and numerically expensive. Second, by computing the exact PDF numerically, it has to operate on totally different PDF representations at each iteration. Therefore, to make the computation efficient, we propose a novel approximation technique in the following that keeps Ct and Tt after nonlinear operations still in the first order canonical form without loss of much accuracy.
To proceed, we represent all downstream node solutions by the following first order canonical forms:
At the sink, Cn is the loading capacitance and Tn is the required arrival time, which are known from design specification.
If the solution at node t is obtained by adding a wire at its direct downstream node n, we compute the new solution at node t as follows:
where
(·Rw0·ηw +Cw0 ·ζw), and Γ=ζw ·α (13) is already a canonical form, but T t in (14) is not due to the quadratic term X T ΓX. To represent Tt as a canonical form, we have the following Theorems:
Theorem 1. Given random variables in vector form X that follow a standard multivariate Gaussian distribution as N(0,I), i.e., E(X) = 0 and E(X
2 ) = I, for any vector δ and matrix Γ, we have 1 :
where E(·) is the expectation operation of a random variable, and tr(·) is the trace operation of a matrix which takes the sum of diagonal elements of the matrix.
where α, β, ζ, and η are all vectors.
We then compute the first two moments of Tt in (14) as follows:
Knowing the first two moments, we compute the mean and variance of Tt in (14) as follows:
We then approximate (14) by the following canonical form that matches its mean and variance with (20) and (21), i.e.,
The above approximation is justified because the amount of variation is relatively small compared to the nominal value. Therefore, by matching the mean and variance, we only lose accuracy for the higher order (third moment and above) terms. Experiment results confirm the effectiveness of this approximation.
If the solution at node t is obtained by adding a buffer at its direct downstream node n, we compute new solutions as follows:
It is obvious that Ct in (23) is already a canonical form, but T t in (24) is not due to the quadratic term X T ΓX. By using the similar technique as discussed above, we approximate (24) via the same canonical form as shown in (22), but with its own Tt0, δt, Γ and X.
If the solution at node t is obtained by merging two solutions from its two sub-trees rooted at nodes m and n, respectively, we compute the new solutions Ct by
which is already a first order canonical form. To express Tt after the minimum operation still to be a first order canonical form, we resort to the concept of tightness probability [3] . Denote σ The tightness probability of tn,m is the probability of Tn less than Tm and is computed by
where Φ is the cumulative density function (CDF) of a standard normal distribution; and σn,m is given by
According to [15] , the mean and variance of the statistical version of T t =min(T n , T m ) can be computed exactly as follows:
where φ is the probability density function (PDF) of a standard normal distribution taking value at
. We then approximate the statistical Tt in (6) after merging two solutions as follows:
which is exactly the mean value of Tt as shown in (28). To match the variance of (30) to the exact variance as shown in (29), we scale the βt in (30) to obtain the new canonical form as follows:
By replacing (13) and (22) with (1) and (2), (23) and (22) with (3) and (4), and (25) and (31) with (5) and (6), respectively, we have replaced all key operations needed in a deterministic buffer insertion algorithm with its respective variation counterpart. Therefore, we obtain a variationaware buffer insertion algorithm. Moreover, because we always keep solutions in first order canonical form after each operation, we can apply the same technique recursively to compute all new solutions while traversing the routing tree bottom up.
Note that the same approximation method via moment matching techniques is not restricted to the first-order canonical model, and it can be extended to handle other nonlinear (like quadratic) forms as well.
VARIATION AWARE PRUNING
Review of Deterministic Pruning
The major complexity of dynamic programming based buffer insertion lies in the merging of two sets of solutions obtained from two different sub-trees. In general, the total number of possible combinations for merging is n · m, where n and m are the number of solutions from two sub-trees, respectively. If all combinations are kept at each merging node, the number of solutions will grow exponentially towards the root. To avoid this problem, [6] proposed to define the dominance relationship (or pruning rule) between two solutions such that solution (C1, T1) dominates solution (C2, T2) if condition C 1 < C 2 and T 1 > T 2 are satisfied. In other words, solution (C 2 , T 2 ) is redundant and can be removed. [6] proved that instead of n · m, there would be no more than n + m number of solutions after pruning. Even though pruning helps reduce the total number of solutions, in general we still have to pay the price of O(n · m) at each node in order to obtain all possible combinations for merging. In deterministic buffer insertion, such a procedure is further reduced to O(n + m) by using a merge sort like operation on the two sets of already sorted solutions. Based upon the above two linear operations on pruning and merging, [6, 7] proved that by keeping only dominating solutions at every node, the dynamic programming based algorithm can solve the buffer insertion problem in O(N 2 ) time without losing optimality, where N is the number of possible buffer locations. When there are B types of buffers in the library, [16] proved that the deterministic buffer insertion problem can be solved optimally in O(B · N 2 ).
Transitive Closure Based Pruning
We observe that for deterministic buffer insertion, the linear time operations for pruning and merging are made possible because of the following two properties: (1) for any two given solutions, there exists an ordering property between them so that comparing them is always possible, i.e., T 1 is either greater than T 2 or less than T 2 ; (2) there exists a transitive ordering property between solutions, i.e., if T1 > T2 and T2 > T3, then T1 > T3 (similarly, if C1 < C2 and C2 < C3, then C1 < C3). If we ensure that the above two properties hold for solutions considering process variations, we can achieve similar linear time complexity for both pruning and merging. In the following, we propose a new variation aware pruning rule that enables us to keep both merging and pruning operations in linear complexity even in the presence of process variations. We first extend the deterministic dominance relation between (C 1 , T 1 ) and (C 2 , T 2 ) by enforcing:
The physical interpretation of this extension is that solution (C 1 , T 1 ) has 100% propability (almost always) to result in a larger required arrival time but with a less loading capacitance when compared to solution (C2, T2). We have the following Lemma 2 :
Lemma
Proof: Let X = T1 − T2 and Y = T 2 − T3 and the JPDF of X and Y be f (x, y).
Because f (x, y) ≥ 0 for all x and y, we have f (x, y) = 0 for x < 0. Similarly, from P (T2 > T3)=P (Y > 0)=1, we have f (x, y) = 0 for y < 0. Therefore, we have
f (x, y)dx=1. As we know P (T1 > T3) ≤ 1, we must have P (T1 > T3)=1. ✷ Lemma 1 shows that comparison between T1 and T2 based upon P (T 1 > T 2 )=1 enforces the transitive ordering property between solutions. However, for any two given solutions, it is not always possible to compare them. Moreover, in practice, such a 100% probability requirement is too restrictive. Therefore, we relax such a requirement by adding two parameters such that solution (C1, T1) is said to dominate solution (C 2 , T 2 ) if the following two conditions hold:
where pL and pT are two parameters between 0.5 and 1. Note, in the context of pruning, it is not interesting to take a value less than 0.5 for either pT or pL. In other words, the probability of C1 less than C2 is greater than pL, while the probability of T 1 greater than T 2 is greater than p T . We call the pruning rule as defined by (34) and (35) a transitive closure based pruning rule. As we will see in the following, the values of the two parameters, pL and pT , determine how good the two desired properties (ordering property and the transitive ordering property between solutions) are approximately true. We have the following lemma:
Lemma 2. Given T1 and T2 as two dependent random variables with arbitrary distributions, we have either P (T1 > T2) ≥ 0.5 or P (T1 < T2) ≥ 0.5.
Proof:
The proof follows directly from the fact that P (T1 > T2)+P (T1 < T2)=1. ✷ Lemma 2 shows that comparison based upon P (T1 > T2) > 0.5 results in a proper ordering between two random solutions T 1 and T 2 . Therefore, the remaining problem is whether or not such a pruning rule preserves the transitive ordering property between solutions. Unfortunately, for arbitrary distributions, we can show that it does not preserve the transitive ordering property in general. In other words, for arbitrary random distributions, when p L = 1 and p T = 1, the transitive ordering property holds but not necessary the ordering property; when pL = 0.5 and pT = 0.5, the ordering property holds but not necessary the transitive ordering property. Therefore, to have the two proprieties to hold simultaneously, we may have to impose some restrictions on the type of distributions for those random solutions. In the following, we prove that when the random solutions follow a joint normal distribution, both properties indeed hold simultaneously, which is stated in the following Lemma:
and T 3 as three dependent random variables with joint normal distributions, if
Proof: To see this, we assume both T 1 and T 2 are normal, and we have the following closed form to evaluate the probability of T1 > T2 according to [15] ,
where Φ is the cumulative density function (CDF) of a standard normal distribution; µT 1 , and µT 2 are the mean values of T 1 and T 2 , respectively; and σ T 1 ,T 2 can be computed by
where σ 2 T 1 and σ 2 T 2 are variance of T1 and T2, respectively; and ρT 1 ,T 2 is the correlation coefficient of T1 and T2.
Because any CDF function is a non-decreasing function, and for the standard normal distribution Φ(0) = 0.5, then we have Φ(x) > 0.5 for any x > 0. Therefore, to have P (T 1 > T 2 ) > 0.5 is equivalent to have Proof: According to Lemma (2) and (3), we can compare two random solutions and order them much the same way as in the deterministic approach. Therefore, following similar arguments as in [6, 7] and [16] , we conclude that our variation aware buffer insertion algorithm under the transitive closure based pruning rule has the same complexity as the deterministic algorithm, which is O(B · N 2 ). ✷ Next we discuss the extension of the above transitive closure pruning rule for other choices of p L and p T and see how the two desired properties are affected. In fact, we have the following theorem which proves that for pT (or pL) between 0.5 and 1, the transitive ordering property always hold.
Theorem 4. Given T1, T2 and T3 as three dependent random variables with joint normal distributions, if
any constant P T between 0.5 and 1, i.e., 0.5 ≤ P T ≤ 1.
Because T1, T2 and T3 are joint normal, then X and Y are also normal. Denote the PDF of X as N (µx, σx) and the PDF of Y as N (µ y , σ y ), where µ x and σ x (similarly µ y and σ y ) are the mean and standard deviation for X (similarly Y ), respectively. Hence we can obtain the PDF of X + Y , which is also a normal distribution, as N (µx + µy, p σ 2 x + σy2 + 2ρσxσy) with ρ being the correlation coefficient between X and Y . We have
where Φ is the CDF of a standard normal distribution. According to the property of the standard normal distribution, we have Φ(−t) = 1 − Φ(t), therefore we have
As we already know P (T1 > T2) > PT , we hence have Φ(
Since any CDF function is also a non-decreasing function, we have µx
where Φ(t) = P T . Moreover, for 0.
From (40) and (41), we have
Because −1 ≤ ρ ≤ 1, it is easy to show that
As t > 0, by multiplying both sides of (44) by t > 0 and then combining it with (43), we have
Therefore, by the fact that Φ is a non-decreasing function and (45), we finally have Having proved the transitive ordering property, it is attempting to see whether the ordering property as in Lemma (2) also holds for different choice of p L and p T . Unfortunately, such an extension is not true in general. However, in practice, we believe that the ordering property still holds (or approximately holds) for different choice of pL and pT . To see this, we plot the probability of T1 greater than T2 under three correlation coefficients (0, 0.5 and 0.9) as shown in Figure 1 , where the x-axis is the mean difference of T 1 and T2, and the y-axis is P (T1 > T2). The first three curves have σT 1 =σT 2 , and the rest of the curves have σT 1 =3σT 2 . According to Figure 1 , we see that when the difference between µT 1 and µT 2 becomes larger, the probability of T1 greater than T 2 is also increasingly larger. For a given required probability, say p L =0.85, it only requires µ T 1 being greater than µT 2 by less than 4 time units. In practice, for a general routing tree, such a small delay difference will likely exist among different solutions, either due to the difference in routing or due to the difference in buffering. Moreover, such ordering becomes even better when two solutions have similar variances and have higher correlations, which is the case for our variation aware buffer insertion because solutions from the same sub-tree or nearby sub-trees are highly correlated inherently.
Based upon Theorem 4 and the above discussion, we believe that, in practice, the transitive closure based pruning rule for different choices of p L and p T other than 0.5 could also ensure the two desired properties, hence enabling us to carry out the variation aware buffer insertion algorithm much the same way as in the deterministic case. Therefore, similar runtime complexity conclusion can be drawn as shown in Theorem 3.
EXPERIMENT RESULTS
Experiment Setting
Two sets of benchmarks are obtained from the public domain for our experiments [17] . The characteristics of the benchmarks are shown in Table 1.   Bench  Sinks  Buffer Positions  p1  269  537  p2  603  1205  r1  267  533  r2  598  1195  r3  862  1723  r4  1903  3805  r5 3101 6201 Table 1 : Characteristics of benchmarks.
Because of the lack of access to the real wafer data, we derive the process variation data based upon the literature that addresses similar process variation issues but in the context of statistical timing analysis [18] . In our experiment, the 65nm BSIM technology is assumed. We budget the random device variation, inter-die global variation, intra-die spatial variation, and interconnect variation all to be 5% of its nominal value, respectively. Moreover, to model the spatial variation, we divide the chip layout into different grids with the length of each grid as 500µm. For devices located at a particular grid, their characteristics are affected by a set of nearby grids. We distribute the budgeted 5% spatial variation into different regions with the sensitivity of each region forming an isotropic stationary Gaussian process with a value that tapers off at a distance of about 2mm.
Comparison Base
Among the reported works [8, 9, 10] in literature, only [9] considered both device and interconnect process variations and provided an exact solution. Therefore, we use [9] as our comparison base in the following.
Because of the complication due to nonlinear operations in computing Ct and Tt, [9] employed a numerical method to compute the distribution of C t and T t explicitly. Moreover, to make the numerical method attractable, [9] further assumed that the variations in devices and interconnects are independent, hence ignoring their shared inter-chip global variation and intra-chip spatial correlations. The PDFs of Ct and Tt are computed numerically via hyper-dimensional integration, which is very expensive and slow to attain reasonable accuracy. In contrast, we employ the first order canonical form to represent both device and interconnect variations and the correlation between them is implicitly considered. Moreover, we compute the new canonical form for C t and T t via the novel approximation technique as discussed in section 3, thus avoiding the complexity of numerical integration yet maintain similar accuracy.
[9] also proposed a two-side threshold based pruning rule by relating the dominance relationship of solutions to designers' willingness of accepting uncertainty for a given design. That is, a threshold value π α gives a measure of a designer's preference for certainty in choosing the design parameter x in the presence of variations, such that the final design would have x less than πα with (100α)% certainty. Mathematically, this is given by
where f (x) is the PDF of x. Given two different thresholds for either C t and T t , for example π α l and π αu for C t , and π β l and π β u for T t , such that 0 ≤ α l < α u ≤ 1 and 0 ≤ βl < βu ≤ 1, solution (C1, T1) is said to dominate solution (C2, T2) if the following conditions are satisfied:
In other words, C1's upper threshold π (1) αu is smaller than C2's lower threshold π (2) α l , while T1's lower threshold π
. Despite of its intuitive definition, such a pruning rule is computationally expensive to use for large designs, because it is not guaranteed that the number of solutions after pruning is linearly.
Runtime Comparison
We first compare the runtime between our algorithm and [9] in terms of the largest benchmark that each algorithm can handle. As reported in [9] , the largest routing tree has only nine (9) sinks. In contrast, our algorithm can easily handle routing trees with more than three thousand sinks (> 3000), and it seems that nothing prevents our algorithm from handling even larger benchmarks 3 . In this sense, we improve the capacity of the algorithm by more than a thousand times.
Furthermore, we note that one of the main reasons in preventing [9] from trying larger benchmarks is that their two-side threshold based pruning rule is not very effective in both pruning and merging solutions. To verify this speculation, we obtain the source code from [9] and modify it based upon the same process variation models as used in this paper. We denote it as T2P because the two-side threshold based pruning rule is used. In other words, we avoid the high complexity in computing the JPDF numerically, but rather we use the same first order canonical form to represent the JPDF implicitly. Therefore, the only difference between our variation aware algorithm VAW and the newly implemented T2P algorithm is the pruning rule. In Table 3 , we report the runtime for both algorithms based upon the benchmarks we have tested. According to Table 3 , we can see that the newly implemented T2P algorithm now can handle much larger benchmarks than what was originally reported in [9] , and the largest tested benchmark is p1 with 269 sinks. This improvement is mainly due to the avoidance of computing JPDF explicitly. However, we still fail to use the improved algorithm of [9] to run larger benchmarks. In fact, for the rest of tested benchmarks, it fails due to exceeding either memory capacity (2G) or tolerable time limit (4 hours in our setting). This observation is expected, because the two-side threshold based pruning rule only imposes partially ordering between solutions, rendering the complexity of merging and pruning very high. In contrast, by using the newly proposed transitive closure pruning rule, our VAW algorithm can easily run through all benchmarks and for the largest benchmark r5, the runtime is about 3 minutes. This significant runtime speedup is achieved because the transitive closure pruning rule as discussed in section 4 enforces a relatively strict ordering between solutions, thus enables an efficient implementation for both merging and pruning.
We also report the runtime for the conventional deterministic worst case buffering algorithm (WORST) in Table 3 . We find the our VAW algorithm runs slower than WORST, but this is expected because of the additional computation needed to handle correlated process variations. If we plot the runtime of VAW versus the number of legal buffer positions, however, we can see that the runtime of our VAW algorithm scales almost linearly with respect to the benchmark size.
RAT Optimization
Enabled with the efficient implementation of buffer insertion considering process variations, we run our buffer insertion algorithm on the benchmarks for RAT optimization and study the effect of process variation on buffered interconnect design. We report the experiment results for both the conventional deterministic worst case design and our variation aware buffer insertion in Table 2 . The RAT is defined as the 3-sigma RAT in the distribution for both algorithms. The 3-sigma yield of VAW is defined as 100%, which in term is used to find the timing yield for the worst Table 2 , we can see that compared to the conventional worst case design, our variation aware buffer insertion improves the 3-sigma RAT by 0.6% on average, and the parametric timing yield by more than 15%, respectively. This highlights the importance of developing efficient algorithms for IC designs to actively attack process variation effects. Interestingly, we observe that, for some relatively small benchmarks, the improvement for RAT and yield is almost negligible, while for some large benchmarks, the improvement is quite significant. We also run the same set of experiments for different choices of pL and pT in (34) and (35) . However, among all tested experiments, we observe almost no difference in the final optimal RAT at the root. These observations to some degree agree with what has been reported in [11] for infinity long two-pin nets. We are currently looking into the theoretical explanation to the above observations. We also report the number of buffers inserted for both algorithms in Table 2 . We see that our variation aware buffering algorithm tends to put more buffers into the design to combat the correlated process variations than the deterministic worst case design. This conclusion is in line with what has been reported in [9] .
Finally, we verify the accuracy of our approach in predicting the RAT distribution via the Monte Carlo simulation. Given a buffered routing tree with process variations, we run our algorithm to compute the PDF of RAT at the root. After we obtain the RAT at the root, its PDF can be obtained by computing its mean and variance and it is approximated as a normal distribution. We find that our algorithm is reasonably accurate in predicating the PDF of RAT and tends to be conservative. On average, less than 10% difference is observed for all benchmarks we have tested.
CONCLUSION
A novel algorithm for buffer insertion considering process variation has been proposed. We have developed an efficient approximation technique to handle the correlated process variations under nonlinear operations. We have also proposed a provable transitive closure pruning rule that enables efficient implementation of the buffer insertion algorithm considering correlated process variations. We have applied the algorithm for timing optimization and concluded that process variation must be considered to achieve optimal designs for parametric timing yield, and buffer insertion considering correlated variation improves the timing yield by more than 15% on average.
