Abstract -Advanced process technologies impose more significant challenges especially when manufactured circuits exhibit substantial process variations. Consideration of process variations becomes critical to ensure high parametric timing yield. During the design stage, fast estimation of the achievable buffered delay can navigate more accurate and efficient wire planning and timing analysis in floorplanning or global routing. In this paper, we derive approximated first-order canonical forms for buffered delay estimation which considers the effect of process variations and the presence of buffer blockages. We empirically show that an existing deterministic delay estimation method will be over-pessimistic and thus result in unnecessary design rollback. The experimental results also show that our method can estimate buffered delay with 4% average error but achieve up to 149 times speedup when compared to a state-of-the-art statistical buffer insertion method.
I. Introduction
Buffer insertion has become an essential technique for interconnect optimization in physical synthesis. An industry study [1] predicts that at 65nm process technology, 35% of the cells on a chip will be buffers. Therefore, one must be able to efficiently and accurately estimate the impact of buffer insertion on a design in earlier stages, such as floorplanning. To this end, Alpert et al. [2] proposed a linear-time algorithm which is an extension of Otten's theory [3] in predicting interconnect delays for multifanout nets in the presence of buffer blockages. Accordingly, [2] can accurately assess the buffering impact in an early design stage without having to actually perform buffer insertion for nets. The estimation results of [2] are within 5% average error under the Elmore delay model when compared to a classic buffer insertion method, i.e., van Ginneken's algorithm [4] . According to [2] , one could embed the fast estimation technique into a floorplanning or routing algorithm to estimate the timing cost for a net.
However, as critical dimensions are scaling quicker than the development of its controlling process technology, technology beyond 90nm exhibits significant variations [5] . Traditional analysis and optimization methods under nominal circuit parameters, e.g., [2] and [4] , become too risky in the presence of process variations. Recently, there emerged many statistical static timing analysis (SSTA) approaches [6, 7] , which greatly increase the analysis accuracy by propagating the distributions instead of single values. Based on these results, other statistical optimization techniques on gate sizing [8] , power reduction [9] , and even buffer insertion [10, 11, 12, 13, 14, 15 ] also surge to effectively alleviate the impact of circuit parameters deviation. To the best of our knowledge, none of any recent publications addresses the problem of variation-aware buffered delay estimation. We have conducted an experiment to see whether it is necessary to consider process variations in predicting buffered delay. We observed that the deterministic buffered delay estimation (DBDE) method [2] using the corner, pt+3G, which is the worst case corner for DBDE, will be too pessimistic. Take Figure 1 for example. If we choose the 99% timing yield, which is computed by a statistical buffer insertion (SBJ) algorithm [15] , as the timing constraint (as indicated by the blue straight line in Figure 1 ), the over-pessimistic result estimated by [2] in the worst case corner (as indicated by the red dot line in Figure 1 ) will force a designer to rollback design without knowing that there is 99% probability to satisfy the given constraint. The observation is reasonable because in recent technology generations, variability was dominated by the Back-End-of-the-Line (BEOL) or interconnect metallization and thus becomes more uncorrelated than before as mentioned in [5] . Consequently, traditional corner-based techniques, which perform optimization at some extreme values, are not applicable nowadays since the number of cases or corners required for confident coverage has grown tremendously. Therefore, it is necessary to develop a more effective approach for statistical delay estimation, which can give a designer a confidence value to decide whether he/she should rollback the design under some given constraint. As shown in Figure  1 , the delay distribution computed by our statistical buffered delay estimation (SBDE) technique (to be presented in section III-B) is close to the one computed by SBI, and the timing yield of our distribution is 99.8% under the same given timing constraint as mentioned above.
In this paper, we study the problem of variation-aware buffered delay estimation. We base on the deterministic linear-time algorithm [2] , and derive the first-order canonical forms for major operations in the algorithm by using some approximation techniques without losing the correlation between random variables. Equipped with the above techniques, we develop an efficient implementation of the variation-aware buffered delay estimation algorithm. The experimental results show that our method can estimate buffered delay with 4% average error but achieve up to 149 times faster than a state-of-the-art statistical buffer insertion method [15] . [6] , one can always transform the set of correlated random variables into a set of independent ones.
C. Problem Formulation
The input to our problem is a buffer library B, a set Blk of rectangular buffer blockages, a set of process parameters (i.e., Rw, Cw, Rb, Cb, Db) with variations, and a routed topology of a net which is modeled as a routing tree T = { V, E} and T may pass through buffer blockages. The problem aims to estimate the worst path delay distribution of the net under optimal buffering but without actually performing any buffer insertion.
D. Useful Properties in Statistics
Here we present three set of properties which are used to derive our equations in the next section. Property 1. We are given two vectors 01 and 02 in R. tr() is the trace operation of a square matrix and equals the sum of the diagonal elements in the matrix. We have:
tr (01o ) 01 02 Property 2. Let X and Y be random variables and k be a scalar. E(-) is the expected value of a random variable. We have:
Property 3. We are given a random vector Xin R' where elements in X are all random variables in the standard Gaussian distribution and are mutually independent. E(-) is the expected value of a random variable. For any vector 0 in R' and any square matrix A in Rn n, we have:
The proofs of Property 3(b)(c)(d) can be found in [16] , and it has also been adopted in [15] .
III. Buffered Delay Estimation
A. Deterministic BufferedDelay Estimation
The linear-time estimation technique proposed in [2] is demonstrated that it only produces few percents error when compared to a buffer insertion method [4] . However, we observed that by ignoring intrinsic buffer delay (which is the case in [2] ), the estimation may lead to 26.42% average error in practice'. As a result, in the following review of the method in [2] , we incorporate the intrinsic buffer delay in the estimation, and denote the modified version as DBDE (which stands for Deterministic Buffered Delay Estimation).
In order to fast and simply predict interconnect delays for multi-fanout nets in the presence of blockages, DBDE is based on the following key assumptions which actually impose small tolerable error when compared to an actual buffer insertion solution. The reasons for making these assumptions, as listed in Table I , will be explained below. DBDE's main idea is to compute the worst path delay at the source in a single bottom-up tree traversal by decomposing the tree edges as out-blockage and in-blockage ones. Whenever an edge intersects with the boundary of a blockage, the edge will be broken into two edges. As a result, each edge lies either completely inside (as an in-blockage edge) or outside blockages (as an out-blockage edge). Similarly, a tree node lies inside a blockage is defined as an in-blockage node; otherwise, it is defined as an out-blockage one. Note that a tree node lies on the boundary of a blockage is identified as an out-blockage node.
The delay of an out-blockage edge is determined by a closed-form formula as shown in Equation (2) where Le is the wirelength of the out-blockage edge. Note that the intrinsic buffer delay has been incorporated into Equation (2) .
D(e)=Le (RWCb +RbCw+ V2RWCW(RbCb +Db)) (2) One can see that the delay is a linear function of the edge wirelength because the Elmore delay becomes linear to the wirelength after optimal buffering is performed. Note that the more buffer types there are actually in the library, more precisely the single buffer type approximation2 can be. Consequently, Equation (2) only uses a single buffer type in predicting optimal buffered delay, which corresponds to assumption (a). Additionally, Equation (2) is derived from a two-pin net, and thus need to set up the assumption (b) for a multi-fanout net. In the derivation of Equation (2), one can also derive the optimal spacing Lopt between buffers. We have Lopt 2(RbCb +Db) 3 As for the delay of an in-blockage edge, there are two scenarios to be considered. 1) For the edge wirelength Le smaller than the optimal spacing Lopt, we can treat the edge as an out-blockage one and use Equation (2) for delay estimation because buffers can be placed (e.g., the two buffers are placed at the front and back of blockage bh in Figure 2 (a)) or potentially sized (e.g., the two buffers placed at the front and back of blockage b2 are downsized in Figure 2 (a)) in such a way as to avoid the blockage, and doing so only suffers a negligible delay penalty. This scenario corresponds to assumption (c).
2) For the edge wirelength Le larger than or equal to the optimal spacing Lopt, the buffers will be best placed right before and right after the blockage (e.g., the two buffers are placed at the front and back of blockage b3 in Figure 2 (a) ) to minimize the quadratic effect of delay, and it corresponds to assumption (d). Based on these assumptions, the original configuration of buffer insertion with multiple buffer types, as depicted in Figure 2 (a), can be translated to a simpler one, as depicted in Figure 2 The pseudo code of DBDE is illustrated in Figure 3 , which is adopted from [2] and has been slightly modified for more accurate delay calculation. The modifications are as follows: First, rather than merely recording in-blockage wire capacitance, the algorithm records the entire downstream capacitance in the variable c(v). In addition, we reset c(v) as the buffer input capacitance when entering or exiting blockages according to assumption (d), as shown in lines 11 and 17 in Figure 3 . Finally, we incorporate the intrinsic buffer delay in both the linear and Elmore delay calculations corresponding to lines 8 and 12 in Figure 3 . Note that the buffer intrinsic delay should be subtracted if the source is located out of a blockage, as shown in line 20 in Figure 3 . As noted by [2] If the edge (u, v) lies in a blockage, we use the same idea of [6] to modify the Elmore delay in line 12 in Figure 3 for process variations as follows: 
tching D 2 RwoCwoRbO wO'
We want a power series expansion ofJ(x) as (13) , so as to integrate the expansion with the remaining parts in a.
f (X) = aO + aIX + a2X +...
Here we apply the Taylor series expansion onj(x) with respect to X= 0 and truncate it until the second order. We have j (X) K X +// X /+\XT B2~ (14) The final step is to apply moment matching we discussed above on a and derive its corresponding first-order canonical form as (11) As for the maximum operation in line 14, we appeal to the concept of tightness probability in [7] or called the binding probability in [17, 18] . Given two random variables P and Q in the first-order canonical form as follows:
The tightness probability TpQ, which represents the probability of P (9) larger than Q, is computed by T d(v) = (do+tr(Q)) + 2 T(2)X (11) By using the similar technique discussed above, we can apply the same approximation on all the random variables which are not in the first-order canonical forms. Almost all the operations listed in Figure   3 can be modified to consider process variations except the computation of linear delay, which is corresponding to line 8 in Figure 3 . For the calculation of linear delay, we need to add a more advanced treatment in our approach for the presence of the square-root operation (which is involved in calculating a). The linear delay per unit wirelength, a, in statistical term can be expressed as follows:
where f (X) = VA+BX+CX2 +DX3 +EX4
A =2R Co0o (RbOCbO +DbO) (12) where 0 = p +Q2 2pupuQ PD is the cumulative density function (CDF) of the standard normal distribution, and p is the correlation coefficient of P and Q. The values of operation PD have been tabulated and can be found in [19] . After computing the mean and variance via the moment generating function provided in [20] , the first-order canonical form of max(P, Q) can be expressed as max (P,Q) = Tp Q PO + TQPp QO + 0 0 O°QO + (TPQUp +TQpaQ)T (19) 0 is the probability density function (PDF) of the standard normal distribution.
The closed formed of the operation in line 16 is as (11) Figure 3 .
( IV. Experimental Results
We call our variation-aware estimation algorithm as Statistical Buffered Delay Estimation (SBDE) and call the deterministic algorithm modified from [2] , as shown in Figure 3 , as the Deterministic Buffered Delay Estimation (DBDE). We also implemented the variation-aware buffer insertion algorithm in [15] , called Statistical Buffer Insertion (SBI), to verify the accuracy of our SBDE algorithm. All the algorithms were implemented in C++ language and compiled by g++ version 3.4.2 on a Linux x86_64 machine with 2G Processor/4GB RAM.
We ran the three different algorithms (DBDE, SBDE, SBI) on two sets of testcases reported in [2] . One set of testcases includes eight groups of randomly generated nets. We summarize their results in Table II Table II and Table III summarize the results of each algorithm. The columns "#sink", "wirelength" and "%blk" give the number of sinks, wirelength and the percentage of the net that was blocked. For each algorithm (DBDE, SBDE, SBI), the corresponding columns "delay", "delay-sd', "#buf', and "runtime" give the mean delay3, the standard deviation of delay, the number of buffers inserted, and the CPU time, respectively. We report the comparison results of SBDE with SBI in the last three columns in which the "delay" and "delay-sd' of SBI are normalized as 100% and the "runtime" of SBDE is normalized as 1.
From Tables II and III (ranging from 99.99% to 90%) from the results reported by SBI to get different timing constraints. Note that the timing constraint gets more relaxed as the timing yield increases. In Figure 5 , the X-axis gives the timing yield and the Y-axis shows the number of testcases passing the given timing constraint with the probability no less than the corresponding timing yield. Undoubtedly, all testcases estimated by SBDE can pass the given timing constraints. 
V. Conclusions
In this paper, we have derived the approximated first-order canonical forms for buffered delay estimation which considers the effect of process variations and the presence of buffer blockages. We empirically show that the deterministic buffered delay estimation using the worst case corner, i.e., pt+3G, will be over-pessimistic. 
