Abstract: Academic clock routing research results has often had limited impact on industry practice, since such practical considerations as hierarchical buffering, rise-time and overshoot constraints, obstacle-and legal location-checking, varying layer parasitics and congestion, and even the underlying design flow are often ignored. This paper explores directions in which traditional formulations can be extended so that the resulting algorithms are more useful in production design environments. Specifically, the following issues are addressed: (i) clock routing for varying layer parasitics with nonzero via parasitics; (ii) obstacle-avoidance clock routing; (iii) a new topology design rule for prescribed-delay clock routing; and (iv) predictive modeling of the clock routing itself. We develop new theoretical analyses and heuristics, and present experimental results that validate our new approaches.
tween a parent node p and its child v may be identified with the child node, i.e., we denote this edge as e v . If d i denotes the signal delay from clock source s 0 to sink s i , then the skew of clock tree T is given by skewT = max s i ;s j 2S jd i ,d j j. The BST problem is formally stated as follows.
Minimum-Cost Bounded Skew Routing Tree (BST) Problem:
Given a set S = fs 1 ; :::; s n g R 2 of sink locations and a skew bound B, find a routing topology G and a minimum-cost clock tree T G S that satisfies skewT G S B.
The BST problem has been previously addressed in [12, 4, 3] . The basic Extended DME (Ex-DME) approach extends the DME algorithm [2, 5] via the concept of a merging region, which is a set of embedding points with feasible skew and minimum merging cost if no detour wiring occurs. For a fixed tree topology, Ex-DME follows the 2-phase approach of the DME algorithm in constructing a bounded-skew tree: (i) a bottom-up phase to construct a binary tree of merging regions which represent the loci of possible embedding points of the internal nodes, and (ii) a top-down phase to determine the exact locations of the internal nodes. We now review necessary concepts from [4, 12, 3] . Region SDRL a ; L b , where the points with skew B (i.e., feasible skew) in turn form the merging region mrv. It is observed in [3] that under Elmore delay each line segment l = p 1 p 2 2 SDRL a ; L b is well-behaved, in that the skew values along l can be either a constant (when L a and L b are Manhattan arcs) or piecewise-linear decreasing, then constant, then piecewise-linear increasing along l. This important property enables the merging region mrv 2 SDRL a ; L b to be constructed in On time [3] . The resulting merging region is a convex polygon bounded by at most 2 Manhattan arcs and 2 horizontal/vertical segments when L a and L b are Manhattan arcs, or a convex polygon bounded by at most 4n (with arbitrary slopes) segments where n is the number of the sinks.
Since each merging region is constructed from the closest boundary segments of its child regions, the method for constructing the merging region is called Boundary Merging and Embedding (BME). When the topology is not prescribed, [12] propose the Extended Greedy-DME algorithm (ExG-DME), which combines merging region computation with topology generation, following the Greedy-DME approach of [6] . ExG-DME allows merging at non-root nodes, whereas Greedy-DME always merges two subtrees at their roots. 
Non-Uniform Layer Parasitics
Consider the practical scenario where per-unit resistance and capacitance values differ between the V-layer (vertical routing layer) and H-layer (horizontal routing layer). 1 We first assume that vias have no resistance and capacitance, then extend our analysis to nonzero via parasitics. Let node v be a node in the topology with children a and b, and let merging region mrv be constructed from joining segments L a mra and L b mrb. When both L a and L b are rectilinear segments or are two single points sitting on a vertical or horizontal line, only one routing layer is needed for merging mra and mrb. Thus, the original BME construction rules [3] still apply in these cases. Corollary 1 below shows that for non-uniform layer parasitics, joining segments will never be Manhattan arcs of nonzero length. Thus we need consider only the possible modification of BME construction rules for the case where the joining segments are two single points not on the same horizontal or vertical line. In this case, both routing layers have to be used for merging mra and mrb. One problem with routing under non-uniform layer parasitics is that different routing patterns between two points will result in different delays, even if the wirelength on both layers are the same. However, if we can prescribe the routing pattern for each edge of the clock tree, the ambiguity of delay values between two points can be avoided. Fig. 1 shows the two simplest routing patterns between two points, which we call the HV and VH routing patterns. Other routing patterns can be considered, but may result in more vias and more complicated computation of merging regions. In [16] , we prove the following theorems. Notice that at the beginning of the construction, each node v is a 1 We assume that there are only two routing layers. Our approach easily extends to multiple routing layers. Table 1 compares the total wirelength of routing solutions under non-uniform and uniform layer parasitics for standard test cases in the literature. Let c 1 , r 1 and c 2 , r 2 be the per-unit capacitance and per-unit resistance for the H-layer and V-layer, respectively. For the uniform layer parasitics, we set c 1 = c 2 = 0:027 f F and r 1 = r 2 = 16:6mΩ. For the nonuniform layer parasitics, we set c 2 = 2:0 c 1 and r 2 = 3:0 r 1 . For simplicity, we use only the HV routing pattern and ignore via resistance and capacitance.
Experiments and Discussion
We see that solutions under non-uniform layer parasitics average 2% more total wirelength than those under uniform layer parasitics. This may be due to merging regions under non-uniform layer parasitics being smaller (and thus having higher merging cost at the next higher level) since the joining segments cannot be Manhattan arcs of nonzero length. Note that when the skew bound is infinite, all joining segments are rectilinear, and thus the routing solutions under non-uniform and uniform layer parasitics have identical total wirelength. Separately, detailed experiments on benchmark r1 have compared the total wirelength of zero-skew routing for different ratios of r 2 =r 1 and c 2 =c 1 . Even as r 2 c 2 =r 1 c 1 changes from 1 to 10, the total wirelength of solutions only varies between +4% and ,1% from that obtained for uniform layer parasitics (i.e., r 2 c 2 =r 1 c 1 = 1). Hence, our new BME method has routing costs that are insensitive to changes in the ratio of H-layer/V-layer RC values.
Routing in the Presence of Obstacles
This section proposes new merging region construction rules when there are obstacles in the routing plane. Without loss of generality, we assume that all obstacles are rectangular. We also assume that an obstacle occupies both the V-layer and H-layer. 2 We first present the analysis for uniform layer parasitics along with experimental results, then extend our method to non-uniform layer parasitics.
Analysis for Uniform Layer Parasitics
Given two merging regions mra and mrb, the merging region Fig. 2(a) Case III. (expand as in Fig. 2(b) ) Both joining segments are vertical segments, possibly of zero length. In Cases I and II an expanded obstacle O can intersect with another obstacle, which must then also expand in the same direction via a sort of "chain reaction". With these obstacle expansion rules, we construct pmrv from child regions mra and mrb as follows. 3 1. Apply the obstacle expansion rules to expand obstacles, and Notice that the purpose of Step 4 is to maximize the total area of pmrv. As shown in Fig. 3 , if we divide subpath P 2 = y ; z ;t into two smaller subpaths y ; z and z ; t, region pmr method of [3] ). 4 Our current implementation simply keeps at most k regions with lowest tree cost for each internal node.
Recall that in the top-down phase of Ex-DME each node v is embedded at a point q 2 L v closest to lp, where p is the parent node of v and L v 2 mrv is one of the joining segments used to construct mrp. However, when L v is a Manhattan arc and there are obstacles intersecting SDRlp; L v , some of the embedding points q 2 L v closest to lp may become infeasible because the shortest path from q to lp is blocked by some obstacle (Fig. 4(a) ). These infeasible embedding points can be removed from L v by applying the obstacle expansion rules with lp and L v being the joining segments (Fig. 4(b) ). The remaining points of L v left uncovered by the expanded obstacles are the feasible embedding locations for v.
Experimental Results
Our obstacle-avoiding BST routing algorithm was tested on four examples respectively having 50, 100, 150 and 555 sinks with uniformly random locations in a 100 by 100 layout region; all four examples have the same 40 randomly generated obstacles shown in Fig. 5 . For comparison, we run the same algorithm on the same test cases without any obstacles. Details of the experiment are as follows. Parasitics are taken from MCNC benchmarks Primary1 and Primary2, i.e., all sinks have identical 0:5pF loading capacitance and the per-unit wire resistance and wire capacitance are 16:6mΩ and 0:027 f F. For each internal node, we maintain at most k = 5 merging regions with lowest tree cost. We use the procedure Find-Shortest-Planar-Path of the Elmore-Planar-DME algorithm [14] to find shortest planar s-t paths. The current implementation uses Dijkstra's algorithm in the visibility graph GV; E (e.g., [10] ) where V consists of the source and destination points s, t along with detour points around 4 Moreover, it may be better to construct and maintain planar merging regions along several shortest planar paths since the planar merging regions along the shortest planar path will not guarantee minimum tree cost at the next higher level, as stated in the Elmore-Planar-DME algorithm [14] the corners of obstacles. The weight jej of edge e = p ; q 2 E is computed on the fly; if e intersects any obstacle, then jej = ∞, else jej = dp; q. The Ωn 2 worst-case running time, where n is the total number of vertices in the obstacle polygons, can be reduced to On log 2 n using techniques in [11] . Table 2 shows that the wirelengths of routing solutions with obstacles are very close to those of routing solutions without obstacles (typically within a few percent). The higher runtimes (reported for a Sun 85 MHz Sparc-5) in the presence of obstacles are due to the current naive implementation of obstacle detection and path-finding. 
Extension to Non-Uniform Layer Parasitics
When the layer parasitics are non-uniform, no joining segment can be a Manhattan arc, so Cases I.2 and II.2 of the obstacle expansion rules are inapplicable. In Cases III and IV, only one routing layer will be used to merge the child regions, so the construction of pla- nar merging regions will be the same as with uniform layer parasitics. Hence, the construction of planar merging regions changes only for Cases I.1 and II.1, i.e., when the joining segments L a and L b are two single points not on the same vertical or horizontal line. We construct planar merging regions for Cases I.1 and II.1 as follows. First, we divide SDRL a ; L b into a set of disjoint rectangles R i that contains no obstacles, as shown in Fig. 6(a Fig. 6(b) . After expansion, "redundant" rectangles contained in the expansions of other rectangles (e.g., rectangles R 2 and R 5 in Fig. 6 are contained in the union of expansions of R 1 , R 3 , R 4 , R 6 and R 7 ) can be removed to simplify the computation. The merging region construction for Cases I.1 and II.1 with non-uniform layer parasitics is summarized as follows. 
A New Prescribed-Delay Topology Rule
Prescribed-delay routing is motivated by hierarchical clock tree constructions used with clock gating, building-block design, and the general trend to lower fanouts in deep-submicron technologies. The prescribed-delay formulation is also useful in that it allows existing zero-skew routing algorithms to address the prescribed (local) skew problem, as follows. Let skewi; j = d i , d j denote the prescribed local skew for sinks s i and s j . By rearranging skew constraint equations, we can express each sink delay relative to the delay of sink s 1 , i.e.,
By adding a pseudo-delay element with delay D , D i to each sink s i and performing zero-skew routing, the resulting clock tree will satisfy the prescribed skew constraints after we remove the pseudo-delay elements. 5 Note that the useful skew problem addressed in [8, 17] is a more general problem of prescribed skew, in that the useful skew specifies a "range" of allowed skew values (rather than an "exact" skew value) for each pair of sinks. (For example, the range can be ,∞;∞ if there is no skew constraint between a certain pair of sinks.) However, the prescribed skew formulation is still very useful for cases where the "negative skew" (or "signed skew") is desired [17] . With prescribed delays, the BST topology construction must take greater care with temporal (as opposed to spatial) compatibilities. While our goal to to minimize the total wirelength, ignoring the balance of subtree delays during the construction can lead to a great deal of detour wiring. Our studies of small instances show that our original BST merging rule, while generally quite effective, can yield topologies that have rank 752 (out of 945), with 65% cost suboptimality, even for instances as small as 6 sinks (see Table 3 ). We have studied a new topology construction rule that merges two subtrees so as to minimize α MC + 1 , α MD, where MC and MD are respectively the total wirelength and maximum source-sink delay of the newly merged subtree. We study this new rule in the context of a meta-heuristic that takes the better of the tree costs according to the original and new merging rules, i.e., we attempt to address the difficult cases for the original rule, rather than find a completely new and general-purpose rule. 6 The parameter α depends on technology and units, since it captures a tradeoff between wirelength and delay. For our technology parameters, we found α = 0:67 to be most effective; all our experimental data reflect this value. 7 Table 3 shows that our metaheuristic substantially improves both average-case and worst-case suboptimality for small instances; Table 4 shows that even for larger instances 5 For example, suppose we have skews 1 ,s 2 = , 10 and skews 2 ,s 3 = + 50 for a 3-sink clock net. Then we have
The pseudo-delay elements with delays 10, 0, and 50 are added to sink s 1 , s 2 , and s 3 , respectively. 6 Edahiro [7] proposed similar greedy min-wirelength and min-delay based topology generation heuristics in the context of wire sizing. Here, we in some sense extend the application of his two heuristics for uniform wire width. We find that neither heuristic is overly strong by itself, and that the combination of both is superior. 7 Our technology parameters are again taken from MCNC benchmarks; see Section 3.1. All experiments were for 1000 random instances with random sink locations in bounding box area = 500000 square units, skew bound B uniformly random in 1; 20 ps, and prescribed delays uniformly random in the range 0; max delay where max delay is itself random in 1; 15 ps. Our ongoing work studies the subtle relationships that unify sink placement, skew assignment, clock tree topology construction, and bounded-skew embedding. the metaheuristic can offer large improvements over the original BST-DME topology construction.
Predictive Modeling
Finally, our work has explored the issue of predictive modeling. The combined effects of deep-submicron physics and constraintdominated designs have led to a recent trend of "constructive estimation" in place of analytic or empirical estimation. Nevertheless, efficient design optimization will always require estimators that are less expensive than actual constructions. A case in point is the clustering objective for hierarchical buffered clock tree synthesis: how can such an objective capture the actual performance of the bounded-skew clock routing algorithm that will be invoked after the buffer hierarchy has been determined?
We have recently implemented a generic model-building capability in our group, and have applied it to model prescribed-delay BST routing cost. Our package uses a slight enhancement of the Levenberg-Marquardt global optimization iteration from Numerical Recipes in C [15] , and for the present experiment we use a simple "sum of powers" model, i.e., costBST = c 1 Table 5 shows that very reasonable model accuracy can be easily obtained. Furthermore, using even three inexpensive parameters (center-of-gravity star cost, skew bound B, and maximum prescribed delay max delay) to supplement the traditional MST cost can significantly improve model accuracy over using the MST cost alone. While the BST construction is actually quite efficient, Table 5 : Average and worst-case relative errors for fitted sum-of-powers prescribed-delay BST cost models, taken over 1000 trials. Default model is for original BST topology construction; *opt models are for optimalcost topology construction. f(MSTcost) = model based on MST cost only. f(MSTcost++) = model based on MST cost, center-of-gravity star cost, skew bound B, and maximum sink delay.
Conclusions
We have extended the bounded-skew routing methodology to address a number of practical clock routing issues. Specifically, we have extended the BST-DME construction to handle non-uniform layer parasitics, nonzero via parasitics, and large obstacles on the clock distribution layers. We have also addressed hierarchical clock routing applications via a new prescribed-delay topology construction rule; our experiments indicate that this rule nicely complements the original ExG-DME topology rule for BST construction. The complementary nature of the two rules is particularly useful for small instances, since most clock subtrees are small in a buffered clock tree. Finally, we have proposed a predictive modeling methodology that can allow close integration of a given BST routing algorithm with a higher-level clock topology generation (sink and buffer clustering) tool. We are continuing to develop further practical clock routing extensions, while also pursuing integration within a commercial cell-based layout tool.
Acknowledgments
We are greatly indebted to Kenneth Yan for performing the experimental analyses of Sections 4 and 5.
