Abstract: Academic clock routing research results has often had limited impact on industry practice, since such practical considerations as hierarchical buffering, rise-time and overshoot constraints, obstacle-and legal location-checking, varying layer parasitics and congestion, and even the underlying design flow are often ignored. This paper explores directions in which traditional formulations can be extended so that the resulting algorithms are more useful in production design environments. Specifically, the following issues are addressed: (i) clock routing for varying layer parasitics with nonzero via parasitics; (ii) obstacle-avoidance clock routing; (iii) a new topology design rule for prescribed-delay clock routing; and (iv) predictive modeling of the clock routing itself. We develop new theoretical analyses and heuristics, and present experimental results that validate our new approaches.
tween a parent node p and its child v may be identified with the child node, i.e., we denote this edge as e,,. If di denotes the signal delay from clock source so to sink si, then the skew of clock tree T is given by skew(T) = max,s,,,sjcs Idi -d,)l. The BST problem is formally stated as follows.
Minimum-Cost Bounded Skew Routing n e e (BST) Problem:
Given a set S = { q , . . . , s n } C xz of sink locations and a skew bound B, find a routing topology G and a minimum-cost clock tree
Tc(S) that satisfies skew(TG(S)) 5 B.
The BST problem has been previously addressed in [12,4, 31. The basic Extended DME (Ex-DME) approach extends the DME algorithm 12, 51 via the concept o f a merging region, which is a set of embedding points with feasible skew and minimum merging cost if no detour wiring occurs. For a fixed tree topology, Ex-DME follows the 2-phase approach of the DME algorithm in constructing a bounded-skew tree: (i) a bottom-up phase to construct a binary tree of merging regions which represent the loci of possible embedding points of the internal nodes, and (ii) a top-down phase to determine the exact locations of the internal nodes. We now review necessary concepts from [4, 12, 31.
For a node v E G with children a and b, its merging region, denoted mr (v) , is constructed from the so-called ''joining segments" L, E mr(a) and Lh E mr(b), which are the closest boundary segments of mr(a) and mr (b) . In practice, La and Lh are either a pair of parallel Manhattan arcs (Le., segments with possibly zero length having slope + 1 or -1) or a pair of parallel rectilinear segments (i.e., horizontal or vertical line segments). The set of points with minimum sum of distances to Lu and Lh form a Shortest Distance Region SDR (L,,Lh) , where the points with skew 5 B (i.e., feasible skew) in turn form the merging region mr (v) . It is observed in [3] that under Elmore delay each line segment 1 = pIp2 E SDR(L,,Lb) is well-behaved, in that the skew values along 1 can be either a constant (when L, and Lh are Manhattan arcs) or piecewise-linear decreasing, then constant, then piecewise-linear increasing along 1. This important property enables the merging region mr(v) E SI)K(L,,Lh) to be constructed in O(n) time [3] . The resulting merging region is a convex polygon bounded by at most 2 Manhattan arcs and 2 horizontal/vertical segments when Lu and , ! + are
Manhattan arcs, or a convex polygon bounded by at most 4n (with arbitrary slopes) segments where n is the number of the sinks. Since each merging region is constructed from the closest boundary segments of its child regions, the method for constructing the merging region is called Boundury Merging and Embedding (BME). When the topology is not prescribed, [12] propose the Extended Greedy-DME algorithm (ExG-DME), which combines merging region computation with topology generation, following the Greedy-DME approach of [6] . ExG-DME allows merging at non-root nodes, whereas Greedy-DME always merges two subtrees at their roots. 
Non-Uniform Layer Parasitics
Consider the practical scenario where per-unit resistance and capacitance values differ between the V-layer (vertical routing layer) and H-layer (horizontal routing layer). ' We first assume that vias have no resistance and capacitance, then extend our analysis to nonzero via parasitics. Let node v be a node in the topology with children a and 6 , and let merging region mr(v) be constructed from joining segments Lu E mr(a) and Lh 5 mr(b). When both L, and Lh are rectilinear segments or are two single points sitting on a vertical or horizontal line, only one routing layer is needed for merging mr(a) and mr(b). Thus, the original BME construction rules 131 still apply in these cases.
Corollary 1 below shows that for non-uniform layer parasitics, joining segments will never be Manhattan arcs of nonzero length. Thus we need consider only the possible modification of BME construction rules for the case where the joining segments are two single points not on the same horizontal or vertical line. In this case, both routing layers have to be used for merging mr(a) and mr(b). One problem with routing under non-uniform layer parasitics is that different routing patterns between two points will result in different delays, even if the wirelength on both layers are the same. However, if we can prescribe the routing pattern for each edge of the clock tree, the ambiguity of delay values between two points can be avoided. Fig. 1 shows the two simplest routing patterns between two points, which we call the H V and VH routing patterns. Other routing patterns can be considered, but may result in more vias and more complicated computation of merging regions. In Notice that at the beginning of the construction, each node vis a ' We assume that there we only two routing layers. Our approach easily extends to multiple routing layers. sink with mr(v) being a single point. Thus, no merging region can have boundary segments which are Manhattan arcs with constant delays, and we have Corollary 1 For non-uniform layer parasitics, each pair of joining segments will be either (i) parallel rectilinear line segments or (ii) two single points.
From the observation in Fig. 1 Table 1 compares the total wirelength of routing solutions under non-uniform and uniform layer parasitics for standard test cases in the literature. Let cl, rl and c2, r;! be the per-unit capacitance and per-unit resistance for the H-layer and V-layer, respectively. For the uniform layer parasitics, we set cl = c2 = 0.021fF and rl = r2 = 16.6mQ. For the nonuniform layer parasitics, we set c2 = 2.0. c1 and r2 = 3.0. r1. For simplicity, we use only the HV routing pattern and ignore via resistance and capacitance.
Experiments and Discussion
We see that solutions under non-uniform layer parasitics average 2% more total wirelength than those under uniform layer parasitics. This may be due to merging regions under non-uniform layer parasitics being smaller (and thus having higher merging cost at the next higher level) since the joining segments cannot be Manhattan arcs of nonzero length. Note that when the skew bound is infinite, all joining segments are rectilinear, and thus the routing solutions under non-uniform and uniform layer parasitics have identical total wirelength. Separately, detailed experiments on benchmark rl have compared the total wirelength of zero-skew routing for different ratios of r2/r1 and C~/ C I . Even as ( r 2 q ) / ( r l c l ) changes from 1 to IO, the total wirelength of solutions only varies between +4% and -1% from that obtained for uniform layer parasitics (i.e., (r2c2)/(rlcl) = I). Hence, our new BME method has routing costs that are insensitive to changes in the ratio of H-layer/V-layer RC values.
Routing in the Presence of Obstacles
This section proposes new merging region construction rules when there are obstacles in the routing plane. Without loss of generality, we assume that all obstacles are rectangular. We also assume that an obstacle occupies both the V-layer and H-layer.2 We first present the analysis for uniform layer parasitics along with experimental results, then extend our method to non-uniform layer parasitics.
Analysis for Uniform Layer Parasitics
Given two merging regions mr(a) and mr(b), the merging region mr(v) of parent node v is constructed from joining segments Lo C mr(a) and Lh E mr(b). Obviously, points p E mr(v) covered by an obstacle are not feasible merging points. Also, points p , p' E *If some obstacle occupies only one routing layer, then the pre-routed wires over the obstacle become the obstacles for the later routing. In other words, the routing over the obstacle has to he planar Indeed, our obstacle-avoidance routing was originally applied to improve planar clock routing [161. In Cases I and I1 an expanded obstacle 0 can intersect with another obstacle, which must then also expand in the same direction via a sort of "chain reaction". With these obstacle expansion rules, we construct pmr(v) from child regions mr(a) and mr(b) as follows.3 1.
L, or

2.
3.
4.
5.
6.
7.
Apply the obstacle expansion rules to expand obstacles, and
If pmr(v) # 0 then stop; otherwise, restore the sizes of all the expanded obstacles and continue with next step.
Compute the shortest planar path P = s t , where s E mr(a)
and t E mr(6). Notice that the purpose of Step 4 is to maximize the total area of pmr (v) . As shown in Fig. 3 , if we divide subpath P2 = yz ' U t into two smaller subpaths y -z and z -t , region pmr* (v) in the Figure will shrink to be within the shortest distance region SDR(y,z). As we can see from Fig. 3, pmr(v) actually consists of several convex polygonal regions. So the number of regions per node may grow exponentially during the bottom-up construction of merging regions (this is the difficulty encountered by the IME jStrictly speaking, there can be joining segments with slopes other than f l , 0, and m although they are not encountered in practice. For joining segments having slopes m with Iml > 1 (Iml < l), we expand obstacles as in Case 111 (IV). (Fig. 4(a) ). These infeasible embedding points can be removed from L, by applying the obstacle expansion rules with I ( p ) and L, being the joining segments (Fig. 4(b) ). The remaining points of L, left uncovered by the expanded obstacles are the feasible embedding locations for v.
Divide path Pinto
Experimental Results
Our obstacle-avoiding BST routing algorithm was tested on four examples respectively having 50, 100, 150 and 555 sinks with uniformly random locations in a 100 by 100 layout region; all four examples have the same 40 randomly generated obstacles shown in Fig. 5 . For comparison, we run the same algorithm on the same test cases without any obstacles. Details of the experiment are as follows. Parasjtics are taken from MCNC benchmarks Primary1 and Primary2, i.e., all sinks have identical 0.5pF loading capacitance and the per-unit wire resistance and wire capacitance are 16.6mQ and 0.027fF. For each internal node, we maintain at most k = 5 merging regions with lowest tree cost. We use the procedure Find-Shortest-Planar-Path of the Elmore-Planar-DME algorithm-[ 141 to find shortest planar s-t paths. The current implementation uses Dijkstra's algorithm in the visibility graph G ( V , E ) (e.g., [lo]) where V consists of the source and destination points s, t along with detour points around 
Extension to Non-Uniform Layer Parasitics
When the layer parasitics are non-uniform, no joining segment can be a Manhattan arc, so Cases 1.2 and 11.2 of the obstacle expansion rules are inapplicable. In Cases 111 and IV, only one routing layer will be used to merge the child regions, so the construction of pla- nar merging regions will be the same as with uniform layer parasitics. Hence, the construction of planar merging regions changes only for Cases 1.1 and 11.1, i.e., when the joining segments Lu and Lb are two single points not on the same vertical or horizontal line. We construct planar merging regions for Cases I. 1 and 11.1 as follows. First, we divide sDR(L,,Lb) into a set of disjoint rectangles Ri that contains no obstacles, as shown in Fig. 6(a) . Let c E Ri and d E Ri be the corner points closest to joining segments Lu and Lb. If prescribed routing patterns are assumed for the shortest planar paths from c to L, and from d to Lb, delays at c and d are well-defined. Since there are no obstacles inside Ri, the planar merging region can be constructed from points c and d for nonuniform layer parasitics using the methods of Section 2. Since larger merging regions will result in smaller merging costs at the next higher level, we further maximize the size of the merging region constructed within each rectangle Rj C SDR (L,,,L,,) , by expanding Ri as shown in Fig. 6(b) . After expansion, "redundant" rectangles contained in the expansions of other rectangles (e.g., rectangles R 2 and R5 in Fig. 6 are contained in the union of expansions of R I , R3, R4, R6 and R7) can be removed to simplify the computation. The merging region construction for Cases I. 1 and 11.1 with non-uniform layer parasitics is summarized as follows.
Divide SDR(L,,Lh) into a set of disjoint rectangles Rj by ex-
tending horizontal boundary segments of the obstacles in SDR(L,, Lh) , and then expand each rectangle Rj until blocked by obstacles.
Remove rectangles
Rj that are completely contained by other rectangles.
For each remainin rectan le Ri do:
Ri an8d E &%e the corner points which are closest to joining segment La and Lh. Apply prescribed routing patterns from c to La and from d to Lh. Calculate delays at c and d , and construct the merging region from points c and d as described in Section 2.
Let c
A New Prescribed-Delay Topology Rule
Prescribed-delay routing is motivated by hierarchical clock tree constructions used with clock gating, building-block design, and the general trend to lower fanouts in deep-submicron technologies. The prescribed-delay formulation is also useful in that it allows existing zero-skew routing algorithms to address the prescribed (local) skew problem, as follows. Let skew(i, j ) = di -d, denote the prescribed local skew for sinks si and s j . By rearranging skew constraint equations, we can express each sink delay relative to the delay of sink SI, i.e., dj = dl + Di, with D1 = 0. Let D = max;==, Di. By adding a pseudo-delay element with delay DDi to each sink si and performing zero-skew routing, the resulting clock tree will satisfy the prescribed skew constraints after we remove the pseudo-delay elements5
Note that the useful skew problem addressed in [8, 171 is a more general problem of prescribed skew, in that the useful skew specifies a "range" of allowed skew values (rather than an "exact" skew value) for each pair of sinks. (For example, the range can be [--, -I if there is no skew constraint between a certain pair of sinks.) However, the prescribed skew formulation is still very useful for cases where the "negative skew" (or "signed skew") is desired [17] . With prescribed delays, the BST topology construction must take greater care with temporal (as opposed to spatial) compatibilities. While our goal to to minimize the total wirelength, ignoring the balance of subtree delays during the construction can lead to a great deal of detour wiring. Our studies of small instances show that our original BST merging rule, while generally quite effective, can yield topologies that have rank 752 (out of 945), with 65% cost suboptimality, even for instances as small as 6 sinks (see Table 3 ). We have studied a new topology construction rule that merges two subtrees so as to minimize a x MC + ( 1 -a) x MD, where MC and M D are respectively the total wirelength and maximum source-sink delay of the newly merged subtree. We study this new rule in the context of a meta-heuristic that takes the better of the tree costs according to the original and new merging rules, i.e., we attempt to address the difficult cases for the original rule, rather than find a completely new and general-purpose rules6 The parameter a depends on technology and units, since it captures a tradeoff between wirelength and delay. For our technology parameters, we found CI = 0.67 to be most effective; all our experimental data reflect this value.7 Table 3 shows that our metaheuristic substantially improves both average-case and worst-case suboptimality for small instances; Table 4 shows that even for larger instances ' For example, suppose we have skrw(s~,sz) = -10 and skew(.sz,s,) = +50 for 'Edahiro 171 proposed similar greedy min-wirelength and min-delay based topology generation heuriatics in the context of wire sizing. Here, we in some sense extend the application of his two heuristics for uniform wire width. We find that neither heuristic is overly strong by itself, and that the combination of both is superior.
'Our technology parameters are again taken from MCNC benchmarks; see Section 3. I , All experiments were for lo00 random instances with random sink locations in bounding box iyea = 500000 square units, skew bound B uniformly random in the metaheuristic can offer large improvements over the original BST-DME topology construction.
Predictive Modeling
Finally, our work has explored the issue of predictive modeling. The combined effects of deep-submicron physics and constraintdominated designs have led to a recent trend of "constructive estimation" in place of analytic or empirical estimation. Nevertheless, efficient design optimization will always require estimators that are less expensive than actual constructions. A case in point is the clustering objective for hierarchical buffered clock tree synthesis: how can such an objective capture the actual performance of the bounded-skew clock routing algorithm that will be invoked after the buffer hierarchy has been determined?
We have recently implemented a generic model-building capability in our group, and have applied it to model prescribed-delay BST routing cost. Our package uses a slight enhancement of the Levenberg-Marquardt global optimization iteration from Nurnericul Recipes in C [15], and for the present experiment we use a simple "sum of powers" model, i.e., cost(BST) = C I pf' + ~2 1 7 7 -t . . . + q p ? for k parameters, along with a simple bottom-up approach for variable identification. Table 5 shows that very reasonable model accuracy can be easily obtained. Furthermore, using even three inexpensive parameters (center-of-gravity star cost, skew bound B, and maximum prescribed delay ma-clelny) to supplement the traditional MST cost can significantly improve model accuracy over using the MST cost alone. While the BST construction is actually quite efficient, Table 5 also shows that accurate models can be found for optimal BST costs (which can be obtained only with exponential runtimes). 
Conclusions
layer parasitics, nonzero via parasitics, and large obstacles on the clock distribution layers. We have also addressed hierarchical clock routing applications via a new prescribed-delay topology construction rule; our experiments indicate that this rule nicely complements the original ExG-DME topology rule for BST construction. The complementary nature of the two rules is particularly useful for small instances, since most clock subtrees are small in a buffered clock tree. Finally, we have proposed a predictive modeling methodology that can allow close integration of a given BST routing algorithm with a higher-level clock topology generation (sink and buffer clustering) tool. We are continuing to develop further practical clock routing extensions, while also pursuing integration within a commercial cell-based layout tool.
Acknowledgments
We are greatly indebted to Kenneth Yan for performing the experimental analyses of Sections 4 and 5.
