Abstract-This work develops an analytic framework for clock tree analysis considering process variations that is shown to correspond well with Monte Carlo results. The analysis framework is used in a new algorithm that constructs deterministic nominal zero-skew clock trees that have reduced sensitivity to process variation. The new algorithm uses a sampling approach to perform route embedding during a bottom-up merging phase, but does not select the best embedding until the top-down phase. This results in clock trees that exhibit a mean skew reduction of 32.4% on average and a standard deviation reduction of 40.7% as verified by Monte Carlo. The average increase in total clock tree capacitance is less than 0.02%.
I. INTRODUCTION
Integrated circuit performance is largely determined by the efficiency and quality of synchronous clock distribution. A poor clock distribution can consume an inordinate amount of power, limit circuit performance or prevent correct functionality at all speeds. Many high performance designs use clock distribution grids to improve performance, but these consume significantly more power than traditional clocks. Symmetric clock trees such as the H-tree are robust, but they cannot be perfectly balanced since sink loads and locations are rarely uniform and sinks may have differing clock edge requirements.
Many other works have presented algorithms to minimize the deterministic skew of a clock tree during physical design [4, 6, 7, 8, 9] . These algorithms typically use wire jogs that balance the sink delays with minimal routing overhead. However, these works only consider that the delay to balance is the nominal value. Process variability of clock delay elements can have a significant impact on the skew in what is called process induced skew. Other works have started to address this issue by improving correlation among sinks [10] or tuning the clock tree after manufacturing [13] . However, reference [10] complicates circuit simulation with cyclic RC graphs and has the potential for DC power consumption. Reference [13] requires improved at-speed testing methodologies and increased testing time.
Nominal, deterministic clock skew can be defined as the largest difference between a fastest and slowest clock sink arrival time:
where AT i is the clock arrival time at sink i and S is the set of clock sinks. Statistical clock tree analysis adds the problem that different manufactured parts will have different slow and fast clock sinks depending on how the sources of variation manifest themselves. Therefore, analysis must be done in a probabilistic manner. Previous works have done this using discretized, numerical integration for only intra-die variation [2] or have focused on non-clock interconnect analysis using a first-order parameterized model [1, 3] . The most similar work [11] to that described in this paper uses a first-order parameterized model but uses heuristic worst-case bounding to minimize skew. This can be overly pessimistic. This work uses a first-order parameterized model to perform statistical skew analysis and construct robust clock trees. In Section II, we describe the sources of variation and how they are modeled in our experiments. In Section III, we propose an improved statistical analysis framework based on a firstorder parameterized Elmore delay computation. In Section IV, deterministic zero-skew clock tree construction is briefly reviewed. Then a new technique is presented for minimizing process-induced skew such that the clock trees remain zeroskew according to traditional deterministic analysis. Experimental results are presented in Section V and conclusions are made in Section VI.
II. PROCESS VARIABILITY
Process variation can be of the inter-die or intra-die type. Both types of variation present challenges to physical design. Inter-die variation is fully correlated for a given die, but separate dies can have different amounts of variation. This requires examining an exponential number corners since it is unclear which corner is the worst for interconnect delay. On the other hand, intra-die variation can lead to process spread within a single die. This work models both inter-and intra-die variation using correlated and independent sources. All variation is assumed to be Gaussian, but this is not a requirement since numerical integration can be used as shown in [2] . The nominal process values are from a generic 90nm process and the parameters vary according to [5] .
This work considers two physical parameters for metal variation: height (H) and width (W). The correlation of CMP variation is on the order of 3.5mm [12] and the tolerance for the range of interconnect density during the fill process is userdefined. Due to these considerations, the interconnect variation is assumed to be fully correlated in this paper. This does not exclude other more advanced models from being used.
A simple extraction model is used for wire resistance and capacitance. Unit resistance is calculated as
where ρ is the resistivity of aluminum (2.8 × 10 −8 Ω · m) and A is the cross sectional area. Capacitance estimation uses a parallel plate model for inter-layer and sidewall capacitance and a term for fringe effects. Unit capacitance can then be described as
where = 3.9 0 is the permitivity of SiO 2 , S is the space between wires, and T is the space between layers. The method can be easily extended with more accurate 3-D extraction models.
III. STATISTICAL ANALYSIS
Statistical timing analysis performs timing analysis of a circuit considering process variations. An Elmore delay model is used for interconnect modeling. In order to retain sensitivities to the sources of variation, a first-order parameterized form is used to represent all probabilistic quantities including delays, arrival times, skews, resistances and capacitances.
A. First-order Parameterized Form
The fundamental model used in this work is based on a weighted sum of Gaussian distributions which can be expressed as
where X r and X i 's are random variables normally distributed with mean zero and variance one, N (0, 1), that represent individual process parameters. The n random variables, X 1 to X n , represent process parameters that are globally shared among all gates or wires while the X r random variable represents local, independent variation. The magnitude of the variation is given by the absolute value of coefficients, a i or a r . a 0 is the nominal value. For clarity, a capital letter refers to either a parameterized form (e.g., A) or a random variable (e.g., X i ) throughout this paper. We now cover the basic manipulations of parameterized forms that are needed for analytic statistical timing analysis of clock trees. Addition/Subtraction -The addition operation is
and subtraction is similar. It should be noted that the root sumof-squares of the independent terms is used in both addition and subtraction.
Multiplication -The multiplication operation is
As can be seen, there are many second order terms. If the signals are not correlated, all cross terms are zero and can be dropped without loss of accuracy as done in [3] . However, in most situations, there will be high levels of correlation. For example, when calculating Elmore delay, the R and C corresponding to a given segment will have highly negative correlation and, therefore, the non-linear terms must be considered. The work in [1] proposed to linearize the squared terms (X 2 i ) into a mean and a normally distributed portion (X i ) such that the original first and second moments are preserved. However, they ignore the non-squared terms (X i X j when i = j) and the terms with the random component (X r ). In addition, they adjust the individual sensitivities which can be misleading as they are larger than those in Monte Carlo simulations. This work considers all cross terms so the multiplication operation is more accurate than previous methods. The mean term for only the squared components is added to the mean and the other nonlinear components are linearized and added to the random sensitivity to prevent corrupting individual sensitivities. The product (C) is summarized with the following coefficients
where α and β are the linear approximation coefficients in X i X j ≈ α + βX r . As in VITA [1] , these coefficients are α = 1 and β = √ 2 for normal distributions. The results using the new procedure are more accurate than VITA, or dropping non-linear terms, in every test case examined.
Maximum/Minimum -The maximum (minimum) operation is calculated as a weighted combination of inputs while matching the analytically calculated first and second moments as in [14] . Fast Monte Carlo with linear regression can be substituted for wide maximum (minimum) operations or in situations where the accuracy is questionable.
B. Process Variation-Induced Skew
Statistical analysis of the clock tree starts by analytically computing the sensitivities of each resistive and capacitive element to obtain the parameterized forms, R and C, respectively. The maximum and minimum delay of each clock subtree and total capacitance are then computed in a bottom-up manner. Given that the addition, multiplication, maximum, and minimum operations were defined in the previous section, the maximum (D i ) and minimum (D i ) delay for a node, i, can be calculated in first-order parametric form using the Elmore delay metric and its fanout delays, j,
where R (i,j) is the resistance and C (i,j) is the capacitance from node i to node j. C j is the total subtree capacitance in firstorder parameterized form that is also calculated bottom-up as a sum of parameterized forms as in
At the root of the tree, the bottom-up traversal results in the maximum and minimum delays for all clock branches in the parameterized form. Taking the difference of the maximum and minimum delays when i is the root gives the statistical skew of the clock tree in parameterized form
To demonstrate the analytic method, zero-skew trees were generated using the approach in [4, 6, 8] . These trees all have zero skew according to nominal, deterministic Elmore delay. The proposed statistical analysis shows that the trees actually have poor skew tolerance to process variations. To verify the statistical analysis, the results are compared with Monte Carlo simulations in Table I . The analytic methodology predicts the mean skew value within 1.6% on average, but the standard deviation is systematically underestimated by an average of 45.3%. This is due to the non-Gaussian nature of the skew distribution. Since skew is always positive, the distribution is truncated at zero and is highly asymmetric in situations with near zero skew. The underestimation of the standard deviation may result in less accuracy, but it does not undermine the fidelity of the analytic analysis capability -it is still able to determine which tree is best. In addition, the mean and sensitivities of the maximum (D root ) and minimum (D root ) sink delay match Monte Carlo very well. The fidelity of the analytic analysis is demonstrated by Figure 1 which compares several deterministic zero-skew trees using both the analytic method and Monte Carlo. The relative ordering of the expected skew and standard deviation is identical. Similar results are seen in other test cases.
IV. CLOCK TREE CONSTRUCTION
A. Review of DME One method of constructing nominal zero-skew clock trees is the Deferred-Merge Embedding (DME) method that was independently proposed in [4, 6, 8] . The DME method requires as input the abstract topology of the clock tree to define the merge order of the sinks. The objective is to produce a clock tree routing such that all sinks have equal delay while consuming the minimum amount of wire routing to reduce clock power. To do this, it defers choosing where to merge two subtrees until the parent merging point is decided. This allows each zero-skew merging location to be selected such that the overall minimum wire length is used. Figure 2 shows an example with four sinks (A-D) during the bottom-up merging and top-down embedding phase of DME. During the bottomup phase, the sinks are merged according to the input topology. The merging determines the valid locations where the two subtrees can be joined such that the new subtree has zero skew and minimum wire length. The locus of points that define the area of the zero-skew merge is called the merging segment (MS).
With rectilinear routing and equal resistance and capacitance on the layers, the MS is an intersection of two Manhattan circles (diamonds). This intersection is always a line with slope of +1 or -1. In Figure 2 , sinks A and B are merged into MS1, sinks C and D are merged into MS2, and then MS1 and MS2 are merged into MS3. If the subtrees to be merged have vastly dissimilar delays, extra wire jogs may be required to achieve zero skew. After the bottom-up phase is completed, a single MS remains such as MS3 in Figure 2 . The root of the clock tree can be placed anywhere on this MS or the point on the MS closest to the desired root location. In a top-down manner, the location of each child MS that is closest to the parent location is selected. The actual wire route is determined or embedded at this stage. The procedure is repeated until the sinks are reached. The resulting tree has minimal wire length and zero skew in nominal process conditions.
B. Sampled Zero-Skew Trees
The method that is presented in the next section relies on a sample-based implementation of traditional DME. Instead of computing the intersection of arbitrary Manhattan arcs as in Figure 3a , a collection of merging points (MPs) is maintained that represents the MS. This set of merging points is referred to as a merging point set (MPS). To calculate a MPS, the traditional MS is computed for all closest pairs, like MP1 and MP2 in Figure 3b , and sampled. When there are more than two closest MPs, the union of all the resulting zero-skew MPs is taken and pruned. The pruning divides the area into a proximity grid of a user-specified resolution and each MP is allocated to the appropriate grid location as in Figure 3c . Any single MP in a grid can replace all other MP for nominal optimization since they each represent a zero-skew merging location with minimal wire length. The pruning maintains a linear number of samples for each MS to provide a variety of candidate merge locations for further merging.
Property 1 All sample-based trees have zero skew in nominal conditions.
Property 1 is induced from the fact that all MP are created from two nominal zero-skew MP. The two zero-skew MP may not be in the "correct" location according to traditional DME, but the new MP location will be adjusted appropriately to maintain zero skew. Therefore, the resulting tree may not have minimum power, but it is always zero skew. The grid size determines the sub-optimality in power. Fig. 3 . Example of (a) DME merging, (b) sampled DME closest pair merging, (c) sampled DME pruning, and (d) sampled DME merging with multiple embeddings. Sample Cap. Sample Time DME Cap. DME Time Fig. 4 . Resolution impact on optimality and run-time of sample-mode DME compared to traditional DME.
Property 2 As the grid resolution approaches zero, the sample-based method will obtain the same power zero-skew solution as traditional DME.
An empirical demonstration of Property 2 can be seen in Figure 4 . At the expense of a few seconds of CPU time, the power of the solution quickly approaches the optimal capacitance produced by traditional DME.
C. Reduced Statistical Skew (RSS) Trees
The proposed method for Reduced Statistical Skew (RSS) clock trees is similar to the well known DME approach, but incorporates the previous sample-based technique. The sampledbased technique allows the route embeddings to be stored during the bottom-up merging process. Each MP remembers from which MPs it was constructed so that it can be appropriately embedded in the top-down phase. The actual wires (embedded mergings) allow the process sensitivity to be considered during clock tree construction. In traditional DME, the embedding does not occur until the top-down phase. In the new method, MPs are embedded during the bottom-up phase, but the selection of which MP and corresponding embedding to use is delayed until the root location is known. The actual merging location is chosen according to the deterministic, nominal skew, so the trees generated with the new approach have zero skew in the nominal case, but also exhibit significantly reduced statistical skew.
Consider Figure 3d that shows the merging of MPS1 and MPS2 into MPS3. According to the sample method proposed in Subsection B, all closest pairs of MPs are merged to form a new MPS. Before the results are pruned, there may be several equally viable zero-skew embeddings for a given location. As an example, the dotted and dashed embeddings in Figure 3d are equally viable candidates. In deterministic nominal analysis, each of these solutions is identical and all but one can be dropped from consideration without degrading solution quality. In statistical analysis, however, this is not true because the solutions have different expected skews and process sensitivities. Keeping all embedded MPs in a grid is not viable since the number of MPs can be as much as m×n for parallel Manhattan arcs with m MPs in MPS1 and n MPs in MPS2. A few properties allow us to determine which criteria are most desirable in choosing which MP embedding to keep; a MP embedding can affect our clock tree skew through its capacitive load, its subtree delay, or its subtree skew.
Subtree Capacitance -With the Elmore delay model and correlated or independent parameters, all sampled DME subtree embeddings rooted at the same location have the same total capacitance with the same parametric sensitivities. This is true because each subtree uses the same (minimum) total amount of horizontal and vertical wire. In the Elmore delay model, a subtree capacitance is a simple sum of these capacitances as in (7) 1 . Since each MP that we wish to compare is in the same grid, they are physically close and have very similar parametric capacitances. The consequences of this are that the choice of an embedding does not significantly effect subsequent mergings through the load capacitance.
Subtree Delay -Typical subtree embeddings show that the delays of the minimum wire length, zero-skew subtrees tend to be highly correlated with similar means. Therefore, the delay does not provide a good differentiating metric.
Subtree Skew -The worst skew of a clock tree is determined by the worst skew of all subtrees. This can be proven by contradiction, but is omitted due to space limitations. The intuition provided by the statement is equally true for clock trees while considering process variation. In addition, since subtree capacitance and subtree delay are not viable for pruning, subtree skew is the most reasonable candidate for differentiating solutions.
Our optimization algorithm, RSS, retains the MP candidate (with embedding) in each grid location that exhibits the least expected skew during bottom-up merging. The merging locations during the bottom-up phase are calculated using deterministic nominal timing analysis. L-shape routes connect the MPs. The order of segments on an L-shape route has an 1 If a higher-order model of interconnect were used, resistive shielding would invalidate this property. However, clock trees tend to use upper layers of metal and wide wires which reduces the effect of resistive shielding. almost negligible effect on variability so segment order is randomly selected. During the bottom-up phase, the maximum and minimum delays are calculated at each MP as presented in Section III. The bottom-up phase results in a MPS where each MP corresponds to a fully embedded clock tree. The top-down phase of RSS selects MP with the minimum statistical skew tree and uses the corresponding minimal wire-length embedding to construct the tree.
V. EXPERIMENTAL RESULTS
The above algorithm was implemented in C++ and executed on a 3.0Ghz Xeon running Linux. The results of the RSS algorithm are compared with a nominal DME solution by running Monte Carlo simulations on the final trees. The Monte Carlo mean and standard deviation of the final clock tree skew is shown in Table II . The mean skew is reduced by 32.4% on average and the standard deviation is reduced by 40.7% on average. The pruning heuristics preserve the O(n) run-time behavior of the traditional DME algorithm, but the coefficient of the linear run-time is determined by the sample resolution. This tradeoff between run-time and and power optimality was selected to favor power optimality as can be seen by less than 0.02% sub-optimality in total tree capacitance on average. The sample resolution was chosen independently for each benchmark so that there are at least 500 grids in both the horizontal and vertical direction. The largest benchmark, r5, with 3101 sinks has only a 16.3% increase in run-time compared to nominal DME. Figure 5 compares two trees on a tiny benchmark to illustrate the differences between RSS and traditional DME. The top tree, created by DME, arbitrarily selects the MS endpoint as the clock tree root. This results in a tree that has high sensitivity to metal skew. One branch of the root is primarily on horizontal metal while the other branch is on vertical metal. Picking the mid-point would be a better choice, but not the best. In the bottom of Figure 5 , RSS picks a point that minimizes the process induced skew. For clarity, note that the vertical dimension is 2× the horizontal dimension in the figures.
VI. CONCLUSION
This work developed an analytic framework for clock tree analysis that has high fidelity with Monte Carlo simulations. In addition, the expected skew is very accurate with a typical error of only 1.6%. The analytic analysis framework was used in the implementation of a Reduced Statistical Skew (RSS) clock tree method that minimizes the statistical skew in nominal zero-skew clock trees. The new approach performs route embedding during the bottom-up phase, but does not select the best embedding until the top-down phase. This results in nominal zero-skew trees that exhibit significantly improved statistical properties. On a set of benchmarks, the mean skew was reduced by 32.4% on average and the standard deviation was reduced by 40.7% on average as verified by Monte Carlo. Runtime is preserved through an efficient proximity pruning technique that ensures O(n) run-time and remains efficient for even the largest benchmark. Less than 0.02% power was added on average due to the sub-optimality of the sampling technique. Fig. 5 . Traditional DME (top) and RSS DME (bottom) trees.
VII. ACKNOWLEDGMENTS

