Abstract-Minimizing clock skew is important in the design of high performance VLSI systems. We present a general clock routing scheme that achieves very small clock skews while still using a reasonable amount of wirelength. Our routing solution is based on the construction of a binary tree using geometric matching. For cell-based designs, the total wirelength of our clock routing tree is on average within a constant factor of the wirelength in an optimal Steiner tree, and in the worst case is bounded by O ( a . &) for n terminals arbitrarily distributed in the 1, X l2 grid. The bottom-up construction readily extends to general cell layouts, where it also achieves essentially zero clock skew within reasonably bounded total wirelength. We have tested our algorithms on numerous random examples and also on layouts of industrial benchmark circuits. The results are promising: our clock routing yields near-zero average clock skew while using total wirelength competitive with previously known methods.
I. INTRODUCTION IRCUIT speed is a major consideration in the design
C of high-performance VLSI systems. In a synchronous VLSI design, limitations on circuit speed are determined by two factors: the delay on the longest path through combinational logic, and the maximum clock skew among the synchronizing components. With advances in VLSI fabrication technology, the switching speed of combinational logic increases dramatically. Thus, the clock skew induced by non-symmetric clock distribution has become a more significant limitation on circuit performance.
Minimization of clock skew has been studied by a number of researchers in recent years. H-tree constructions have been used extensively for clock routing in regular systolic arrays [2] , [ l l ] , [15] , [34] . Although the H-tree structure can significantly reduce clock skew [ 113, [34] , it is applicable primarily when all of the synchronizing components are identical in size and are placed in a symmetric array. Ramananathan and Shin [23] proposed a clock distribution scheme for building block design where all blocks are organized in a hierarchical structure. They assume that all clock entry points are known at each level of the hierarchy and, moreover, that the number of blocks at each level is small since an exhaustive search algorithm is used to enumerate all possible routes. Fishburn [14] gave methods to maximize the margin of error in clocking constraints, and to minimize the clock period while avoiding clock hazards, or race conditions. This is accomplished via a linear programming formulation. However, the approach assumes that the entire clock tree topology is already known. Jackson, Srinivasan, and Kuh [18] presented a clock routing scheme for circuits with many small cells. Their algorithm recursively partitions a circuit into two equal parts, and then connects the center of mass of the whole circuit to the centers of mass of the two sub-circuits. Although it was shown that the maximum difference in path length from the root to different synchronizing components is bounded by O ( m ) in the average case, small examples exist for which the wirelengths between clock source and clock pins can vary by as much as half the chip diameter.
In this paper, we first study the problem of high-performance clock routing for cell-based designs, i.e., circuits with many small cells, such as with standard-cell or sea-of-gates design styles. We then extend our method to general cell (also known as building-block) layouts, where the wiring is restricted to specific channels. In either of these scenarios, the H-tree approach cannot be used since synchronizing components may be of different sizes and may be in arbitrary locations in the layout. The method of [23] cannot be applied either, since there is no natural hierarchical structure associated with the design and the number of synchronizing components is typically too large to allow exhaustive examination of all possible routes. The algorithm of [ 181 is not completely satisfactory since large skews may result even for small examples, while the approach of [14] does not construct an actual clock routing topology. With this in mind, the goal of our present work is to develop a clock routing methodology which minimizes skew while incurring little added wiring expense.
We present a basic algorithm and several variants, 0278-0070/93$03.00 0 1993 IEEE which minimize skew by constructing a clock tree that is balanced with respect to root-leaf pathlengths in the tree (these notions will be formalized below). The approach is based on geometric matching: we start with a set of trees, each containing a single terminal of the clock signal net. At each level, we combine the trees into bigger trees using the edges of geometric matching. The end result is a binary tree whose leaves are the terminals in the clock signal net and whose root is the clock entry point. Our method is particularly suitable for designs which employ a single large buffer to drive the entire clock tree, rather than a buffer hierarchy. There are a number of reasons for such a design choice, as discussed in [2] . We note that the recently announced DEC Alpha processor uses such a single-buffer design style [ 121.
In the cell-based design regime, our algorithm guarantees perfect pathlength balanced trees for inputs of four or less pins. Extensive experimental results indicate that even for large clock signal nets, the maximum difference of pathlengths in the clock tree constructed by our algorithm remains essentially zero. This performance is obtained without undue sacrifice of wirelength: we prove that on average the total wirelength in our clock tree construction is within a constant factor of the wirelength in the optimal Steiner tree. Furthermore, our worst-case clock tree cost is bounded by O ( m &) for n terminals in the lI X l2 grid,' which is the same bound as for the worst-case cost of the optimal Steiner tree.
Since the work in [ 181 addresses minimum-skew clock routing for cell-based designs, we implemented the algorithm of [ 181 for comparison purposes. For uniformly random sets of up to 1024 pins in the l1 x l2 grid, our method produced clock routings with near-zero clock skew both in the average case and worst case, with total wirelength of the clock tree significantly lower than that produced by the method of [ 181. In addition, our routing results for layouts of the MCNC Primary1 and Primary2 benchmarks are significantly better than those reported by [ 181 ; we obtain perfectly balanced root-leaf pathlengths in the clock tree using several percent less total wire than the method of [18] . Actual clock skews for our benchmark routings, as determined by SPICE simulation, are reasonable.
We then apply our method to general cell design, by extending the notion of matching to arbitrary weighted graphs. In this scenario our algorithm produces a clock routing tree that is embedded in the channel intersection graph [lo] of an arbitrary building-block layout. The clock routing trees produced by our method attain almost zero skew with only modest wirelength penalty. Experimental results show that the pathlength skew of our routing tree is less than 2% of the skew for a heuristic Steiner tree. This is achieved on average with less than 50% increase in wiring cost over the Steiner tree. The remainder of this paper is organized as follows. Section I1 defines a number of basic concepts and gives a precise formulation of our skew minimization problem. In Section 111, we present the clock routing algorithm in detail for cell-based designs; Section IV extends the algorithm to general cell layouts. Experimental results of our algorithm and comparisons with previous methods are presented in Section V, and Section VI concludes with possible extensions of the method. Early versions of this paper were presented in [19] and [8] .
11. PRELIMINARIES A synchronous VLSI circuit consists of two types of elements, synchronizing elements (such as registers) and combinational logic gates (such as NAND gates and NOR gates). The synchronizing elements are connected to one or more system-wide clock signals. Every closed path in a synchronous circuit contains at least one synchronizing element (Fig. 1) . The speed of a synchronous circuit is mainly determined by the clock periods. It is well known [ 11, [ 181 that the clock period C, of each clock signal net satisfies the inequality:
where td is the delay on the longest path through combinational logic, tskew is the clock skew, tsu is the set up time of the synchronizing elements (assuming that the synchronizing elements are edge triggered), and tds is the propagation delay within the synchronizing elements.
The term td itself can be further decomposed into td = ciated with the interconnect 6f the longest path through combinational logic, and td is the delay through the combinational logic gates on this path. As VLSI feature sizes become smaller, the terms t,,, tds, and td-gates all decrease significantly. Therefore, as noted above, td-interconnec, and tskew become more dominant factors in determining circuit performance. It was noted in [ll that t&w may account for 10% or more of the system cycle time. The objective of this paper is to minimize tskew, while we have subsequently addressed the problem of minimizing td-inrerconnect in a different work [9] .
Given a routing solution for a clock signal net, the clock skew is defined to be the maximum difference among the delays from the clock entry point (CEP) to synchronizing elements in the net. The delay from the CEP to any synchronizing element depends on the wirelength from the CEP to the synchronizing element, the RC constants of wire segments in the routing, and the overall topology of the routing solution. Usually, the clock routing may be described as an RC tree [24], and we commonly use the first-order moment of the impulse response (also called Elmore delay) to approximate delay in an RC tree. The formulas derived by Rubinstein, Penfield and Horowitz [24] give both upper and lower bounds on delay in an RC tree.
However, although both the formula for Elmore delay td-interconnect -k tr-garrs, where td interconnect is the delay asso- A clock routing solution is represented by a rooted (Steiner) tree in the layout whose root is the CEP and whose leaves are synchronizing elements in the clock signal net. The length, or cost, of an edge in the tree is the Manhattan distance between the two endpoints of the edge, and the tree cost is the sum of all edge costs in the tree.
Dejinition: The pathlength skew of a tree is the maximum difference of the pathlengths in the tree from the root to any two leaves.
A tree is called a peqect pathlength balanced tree if its pathlength skew is zero. It is not difficult to construct a perfect pathlength balanced tree if we are allowed to use an arbitrary amount of wire. However, a routing tree with very high cost may distort the clock signal due to longer signal rise and fall times. Thus, we wish to construct a clock routing tree whose pathlength skew is as small as possible, without making the total tree cost too large. With this in mind, we formulate the clock routing problem as follows :
The Pathlength Balanced Tree (PBT) Problem: Given a set of n terminals, N , and real numbers B and S , find a clock routing tree connecting N such that the pathlength skew of the tree is bounded by S and the tree cost is bounded by B.
The following is immediately evident: Theorem I: the PBT problem is NP-hard.
Proof: Set S = 00 so that the PBT problem simplifies to the minimum rectilinear Steiner tree problem,
0
Our objective is to give a heuristic algorithm for the PBT problem. For cell-based design methodologies, we wish to construct a clock tree with pathlength skew as small as possible, using wirelength as close as possible to that in an optimal Steiner tree. Specifically we would like to obtain a clock routing solution in the 1, X l2 grid which which is known to be NP-complete 
A CLOCK ROUTING ALGORITHM FOR CELL-BASED DESIGN
For cell-based design, point-to-point interconnection cost is closely approximated by (Manhattan) geometric distance. Thus, in developing our clock routing algorithm for cell-based layouts, we introduce the notion of a geometric matching:
DeJnition: Given a set of 2k terminals, a geometric matching on this set consists of k line segments between terminals, with no two of the k segments sharing an endpoint.
Each line segment in the matching defines a matching edge. The cost of a geometric matching is the sum of the costs of its matching edges. A geometric matching on a set of terminals is optimal if its cost is minimum among all possible geometric matchings. An example of an optimal geometric matching over four terminals is shown in Fig. 
.
To construct a tree by iterative matching, we begin with a forest of n isolated terminals (for convenience, assume that n is a power of 2 ) , each of which is considered to be a tree with CEP equal to the location of the terminal itself. The minimum-cost geometric matching on these n CEPs yields n / 2 segments, each of which defines a subtree with two nodes. The optimal CEP into each subtree of two nodes is the midpoint of the corresponding segment, i.e., such that the clock signal will have zero skew between the segment endpoints.
In general, the matching operation will pair up the CEP's (roots) of all trees in the current forest. At each level, we choose the root of each new merged tree to be the balance point which minimizes pathlength skew to the leaves of its two subtrees (see Fig. 3 ) . The balance point is the point p along the "straightline" connecting the roots of the two subtrees, such that the maximum difference in pathlengths from p to any two leaves in the combined tree is minimized. Computing the balance point requires constant time if we know the minimum and maximum rootleaf pathlengths in each of the two subtrees, and these values can be maintained incrementally using constant time per each node added to the clock tree.
Notice that at each level of the recursion, we only have to match half as many nodes as in the previous level.
Thus, after [log n1 matching iterations, we obtain the complete clock tree topology. In practice, we actually compute min-cost maximum cardinality matchings, i.e., if there are 2m + 1 nodes, we find the optimal m-segment matching and match m + 1 CEPs at the next level. Fig.   4 describes of our clock routing algorithm ALGl for cellbased design.
The following two results show that ALGl indeed uses a reasonable amount of wirelength. We prove that our clock tree cost grows at the same asymptotic rate as the worst-case optimal Steiner tree cost over n terminals; we also show that our tree cost is on average within a constant factor of the optimal Steiner tree cost.
Theorem 2:
For n terminals arbitrarily distributed in the l1 X l2 grid, the maximum total wirelength of Tac1 is 0(J1,1, -&>.
Proof: For n terminals in the l1 x l2 grid, the worstcase cost of an optimal matching is O ( m * &) [31] . Since the clock tree is formed by the edges of a matching on n terminals, plus the edges of a matching on n/2 terminals, etc., the total edgelength in the tree is
This is of the same order as the maximum possible total edge length for the optimal Steiner tree on n terminals [28]. Note that Theorem 2 does not directly relate the cost of our clock routing construction to the cost of the optimal Steiner tree; this is partially addressed by the following. Theorem 3: For random sets of terminals chosen from a uniform distribution in the lI x l2 grid, the total edgelength of the ALGl clock tree will be on average within a constant factor of the total edgelength of the optimal Steiner tree.
Proof: The minimum Steiner tree cost for n terminals randomly chosen from a uniform distribution in the lI X l2 Manhattan grid grows as The balancing operation to determine the CEP of a merged tree is necessary because the root-leaf pathlength might vary between subtrees at a given stage of the construction. In general, when we merge subtrees TI and T2 into a higher level subtree T, the optimal entry point of T will not be equidistant from the entry points of TI and T2 (this can be seen in the example of Fig. 3) . Intuitively, balancing entails "sliding" the CEP along the "bar of the H." However, it might not always be possible to obtain perfectly balanced pathlengths in this manner (see Fig.  5 ) .
We therefore use a further optimization, which we call H-Jipping: for each edge e added to the layout which matches CEPs on edges el and e2, replace the "H" formed by the three edges e , e l , and e2 by the "H" over the same four terminals which (i) minimizes pathlength skew, and (ii) to break ties, minimizes tree cost. We now prove that for four terminals it is always possible find an "H" orientation which achieves zero clock skew, and we also bound the increase in wirelength caused by H-flipping for nets of size four. As discussed below, extensive empirical tests confirm that even for very large inputs, the H-flipping refinement almost always yields perfectly path-balanced trees with essentially no increase in wirelength.
If a net is of size two, ALGl selects the midpoint of the segment connecting the two terminals as the balance point, and this clearly yields a perfect pathlength balanced tree. Now we show that for nets of size four, ALGl with any level of the construction [31]. the H-flipping refinement also yields perfect pathlength balanced trees (a net of size three can be treated as a net of size four in which two terminals coincide).
Let a, b, c , and d be the terminals in a net of size four. Without loss of generality, assume that ab and cd are the edges in an optimal matching and ab 2 cd. (for convenience, we use ab to denote both the segment ab and also its length. Let ml and m2 be the midpoints of ab and cd, respectively. According to ALG1, ml is chosen to be the root of the subtree for a and b, and m2 is chosen to be the root of the subtree for c and d . Then, the algorithm tries to choose the balance point p on segment m1 m2 such that
2
It is easy to see that if m1m2 1 (ab -c d ) / 2 , we can always choose p satisfying (1). In this case, the pathlengths from p to all four terminals are the same, so that we have a perfect pathlength balanced tree. However, if ml m2 < (ab -c d ) / 2 , we perform H-flipping and replace ab and cd by ad and bc. Then the midpoint nl on bc is the root of the subtree for b and c, and the midpoint n2 on ad is the root of the subtree for a and d . We then seek p f on n1n2 such that ad bc -+ p'nl = -+ p f n 2 ,
According to the following lemma, we are guaranteed to find p' on nl n2 satisfying (2). 
ab + bc cd + ad
Let x be the midpoint of bd. Using similar triangles and the triangle inequality, we obtain ab cd -= xn2 I n1n2 + xnl = nln2 + - so that ab + bc cd + ad
Lemma 1 implies that we can always choose the balance point p f on nl n2 after H-flipping. Therefore, ALGl always constructs a perfect pathlength balanced tree for a net of size four. The following lemma shows that when we replace ab and cd by ad and bc in the H-flip, the wirelength increase is bounded by a constant factor. Thus, from (4) and ( 5 ) we have
i.e., the time complexity of ALGl is asymptotically equal 2 2 -2 to the time complexity of the underlying matching algos u b + -rithm.
or bc + ad c 3(ab + c d ) .
I v . A CLOCK ROUTING ALGORITHM FOR GENERAL Together, these lemmas imply:
Theorem 4: It is always possible to find an "H" ori-CELL DESIGN entation over four terminals which achieves zero clock skew, using at most a constant factor extra wirelength.
We now briefly discuss complexity issues and the reThe same idea of bottom-up iterative matching which we developed in the preceding section may be easily generalized to clock routing in block layouts. In this section, quirement of an efficient implementation. Since our method is based on geometric matching, its time complexity depends on that of the matching subroutine. A well-known algorithm for general weighted matching requires time O(n3) [16] , [21] . By taking advantage of the planar geometry, the algorithmic complexity can be reduced to O(n2.5 log n) [33] . However, even this may be excessive for large problem instances.
In order to solve problems of practical interest, and since there is no clear relationship between the optimality of the matching and the magnitude of the skew of the resulting clock tree, we may choose to speed up the implementation by using efficient geometric matching heuristics [3] , [29] , [30] . Although most of these methods were designed for the Euclidean plane, they also perform well in the Manhattan metric, especially if their output is further improved by uncrossing pairs of intersecting edges in the heuristic matching (in any metric, this reduces the matching cost due to the triangle inequality; to this end, note that k intersections of n line segments may be found efficiently in time O(n log n + k ) [7] ).
We shall later discuss empirical results from implementation of ALGl based on three matching methods which require time O(n), O(n log n) and O(n log2 n), respectively. Each of these three matching heuristics yields very good clock routing solutions.
The basic approach of ALGl thus consists of rlog n1 applications of the matching algorithm. H-flipping requires constant time per node, and therefore does not add to the asymptotic time complexity. If the underlying we extend our method to such general cell designs, where a circuit is partitioned into a set of arbitrarily-sized rectangular cells (also referred to as blocks). Blocks may be of widely varying sizes, and are not necessarily placed in any regular arrangement. The routing is carried out in the channels between blocks, with routing over blocks prohibited. For this design style, the approximation of routing cost by geometric distance, which we used for cellbased design in the previous section, does not apply. The feasible routing regions are represented by the channel intersection graph (CIG) [ lo] , which represents the available routing channels induced by a module layout. To capture the locations of clock pins within channels, we use the augmented channel intersection graph (ACIG), which is constructed as follows: for each pin incident to a routing channel, introduce a new node into the channel intersection graph which breaks the channel edge into two new edges (see the top left of Fig. 9 ).
Our goal is still to construct a clock signal tree with both skew and total wirelength as small as possible, except that routing of tree edges is now restricted to lie within prescribed routing channels. Given a graph G with positive edge costs, we let minpathG(x, y) denote the minimum cost path between nodes x and y, and use distG(x, y) to denote the cost of minpathG(x, y). The notion of a matching may be extended to arbitrary weighted graphs as follows: Lemma 3: Each edge of G belongs to at most one shortest path in an optimal complete generalized matching o n N E Vin G.
Proofi Let M be an optimal complete generalized matching on N. Suppose that edge e appears in both minpathG(x,, y,) and minpathG(x,, y,), where (x,, y,) and (x,, y,) are in M and i # j (see Fig. 8 ) . Because (x,, y,) and (x,, y,) E M are shortest paths in G, we have 
0
Henceforth, we will assume that there are b blocks in the design. G is the underlying augmented channel intersection graph and we assume that the n clock terminals are embedded on edges of G.
Lemma 4:
The routing cost between any two clock terminals in G is bounded by l1 + 12.
Proofi Let x and y be two clock terminals in G. Let P I be any monotone (staircase) path passing through x and connecting two opposite comers w and w ' of the layout grid. Clearly, cost(P1) = l1 + 12. Similarly, let P2 be a monotone path passing through y and connecting w and w'. Then, cost(P1) + cost(P2) = 2 -(II + 12). Since at least one of w or w' can be reached from both x and y with cost at most l1 + 12, the shortest path between x and y has cost no more than I , + l,.
0
Proofi Let x and y be two clock terminals in G. Let P I be any monotone (staircase) path passing through x and connecting two opposite comers w and w' of the layout grid. Clearly, cost(P1) = lI + 12. Similarly, let P2 be a monotone path passing through y and connecting w and w'. Then, cost(Pl) + cost(P2) = 2(11 + 12). Since at least one of w or w' can be reached from both x and y with cost at most l1 + 12, the shortest path between x and y has cost no more than l1 + 12.
It is clear from Lemma 4 that an optimal complete generalized matching on the clock terminals in G has cost no more than (II + 12) -Ln/2J .
As in the previous section, our basic strategy is to con- struct a clock tree by computing a sequence of generalized matchings on the clock terminals. We begin with a forest of n isolated clock terminals in G (again for convenience, we assume that n is a power of 2), each of which is a degenerate tree with CEP being the terminal itself. The optimal complete generalized matching on these n terminals yields n / 2 paths, each of which defines a subtree. The optimal CEP into each subtree is the midpoint of the corresponding path, so that the clock signal will have zero skew between the two terminals. At each level, we compute an optimal generalized matching on the set of CEPs (roots) of all subtrees in the current forest and merge each pair of subtrees into a larger subtree. As before, the root (CEP) of the resulting tree is chosen to be the balance point on the path connecting the two subtrees such that the pathlength skew in the resulting tree is minimized (see Fig. 9 ). Notice that at each level of the recursion, we only have to match half as many nodes as at the previous level. Thus, in [log n l matching iterations, we obtain a complete clock tree topology. If n is not a power of 2, then as noted in the discussion of ALGl, there will be an odd number 2m + 1 of nodes to match at some level. For such cases, we compute an optimal maximum-cardinality generalized matching on 2m nodes, and then match m + 1 nodes at the next level. Fig. 10 gives a formal description of our clock routing algorithm ALG2 for general cell design.
The worst-case clock tree cost produced by the algorithm can be bounded as follows:
Theorem 5: Given b blocks in the I , x l2 grid and n terminals of a clock signal net, the cost of the clock tree created by ALG2 is at most (11 + 12) -n.
Proofi By Lemma 4,. the cost of a generalized matching on n terminals is bounded by (11 + 12) -Ln/2 J .
After each iteration, the number of nodes to be matched is reduced by half. Therefore, the total clock tree cost is bounded by We may then apply an O(n3) algorithm for computing an optimal complete matching in general graphs [2 11. However, this complexity will result in long runtimes for large problem instances. Therefore, in order to achieve an efficient implementation, we use the greedy matching heuristic [26] . Such a heuristic matching may be improved by removing overlapping edges of shortest paths, as described in the proof of Lemma 3, so that no edge is used in more than one shortest path. The time complexity of each iteration of ALG2 is dominated by the O(b2) all-pairs shortest paths computation, which we invoke [log n ] times, so that the overall time complex'ity of ALG2 is O(b2 -log n). This complexity is reasonable since the number of blocks is typically not large.
V. EXPERIMENTAL RESULTS
Both ALGl and ALG2 were implemented in ANSI C for the Sun-4, Macintosh, and IBM 3090 environments.
This section summarizes the simulation results.
Empirical Data for Cell-Based Designs
We have implemented three basic heuristic variants of ALG 1, corresponding to different matching subroutines. The first heuristic variant (SP) uses the linear-time space partitioning heuristic of [30] to compute an approximate matching; the second variant (GR) uses an O(n log2 n) greedy matching heuristic 1291; and the third variant (SFC) uses an O(n log n) spacefilling curve-based method [3] . We have further tested these three variants by running each both with and without two refinements: (1) removing all edge crossings in the heuristic matching, and (2) performing "H-flipping" as necessary. Either of these optimizations can be independently added to any of the three variants, yielding a total of twelve distinct versions of the basic algorithm. The variants of the algorithm are denoted and summarized as follows:
SP: Use the space-partitioning matching heuristic of [30], which induces the matching through recursive bisection of the region (rather than bisection of the set of terminal locations). GR: Use a greedy matching heuristic, which always adds the shortest edge between unmatched terminals SFC: Use a space-filling curve to map the plane to a circle, then choose the better of the two embedded matchings (i.e., either all odd edges or all even edges in the induced Hamiltonian cycle through the terminal locations) [3] . SP+E, GR+E, SFC+E: Same as SP, GR, and SFC, respectively, except that the heuristic matching cost is further improved by edge-uncrossing .
SP+H, GR+H, SFC+H:
Same as SP, GR, and SFC, respectively, except that pathlength skew is further reduced by H-flipping. SP+E+H, GR+E+H, SFC+E+H: Same as SP, GR, and SFC, respectively, except that both edgeuncrossing and H-flipping are performed.
1291.
For comparison, we also implemented
MMM: The method of means and medians, similar
The algorithms were tested on random sets of up to 1024 terminals generated from a uniform distribution in the loo0 x loo0 grid (i.e., l I = l2 = 10oO). Results for a sample run with 50 random terminal sets at each cardinality are summarized here: Table I compares the average  tree costs and Table I1 for all heuristics. The data in the tables is given in grid units.
The computational results indicate that both optimizations (edge-uncrossing and H-flipping) significantly improve both skew and total wirelength. When the refinements are combined, average pathlength skew essentially vanishes, and the wirelength of several variants is superior to the output of MMM. The best variant appears to be GR +E +H, which is based on the greedy matching heuristic together with edge-uncrossing and H-flipping .
Note that the cost of the greedy matching is asymptotically as good as that of the optimal matching [26]. Tables   111 and IV highlight the contrast between GR+E+H 4  656  1197  1823  555  1125  1668  8  1089  2136  2943  1123  1979  2810  16  2841  3506  4221  2793  3322  3993  32  4813  5598  6216  4695  5273  5866  64  7624  8377  9266  7372  7982  8556  128  11439  12276  13136  11052  11697  12243  256  17220  17874  18549  16379  16955  17543  512  25093  25666  26291  23866  24465  25325  1024  36126  36765  37561  34231  34965  36179 for both total wirelength and skew. Figs. 11 and 12 depict these same comparisons graphically. As noted in [20] , any set of approximation heuristics induces a meta-heuristic which returns the best solution found by any heuristic in the set; we also implemented this (denoted as "Meta"), which returns the minimumskew result from all of the other variants. Interestingly, in our experience Meta always returns a perfect pathlength balanced tree, i.e., for each problem instance, at least one of the other heuristic variants will yield a zero clock skew solution. This is very useful, especially when the heuristics are of similar complexity. For example, we can solve the Primaryl benchmark using all twelve methods in under two minutes on a Sun-4/60 workstation. Fig. 13 and 14 illustrate the output of variant GR + E +H on the Primary 1 and Primary2 benchmarks, using the same placement solutions as in [18] ; note that although edges are depicted as straight lines in these diagrams, they are actually routed rectilinearly. Table V compares the results of GR+E+H and the results of [18] which were provided by the authors [27]: GR+E+H completely eliminates pathlength skew while using 5 %-7% less wirelength. To confirm the correlation between the linear delay model and actual delay, we ran SPICE simulations on the Primaryl and Primary2 clock trees using MOSIS 2.0-pm CMOS technology parameters and 0.3-pF sink loading capacitance); the simulated skews of our clock trees for Primaryl and Primary2 were 181 ps and 741 ps, respectively'. Notice that this clock skew was 'Vias and parasitic difference between metal layers were not considered in our simulation because detailed layer assignment has not been determined at this stage of clock routing. obtained simply by balancing CEP-leaf pathlengths; as discussed in Section VI, more sophisticated delay models can yield a better choice of balance points in the matching-based construction.
Empirical Data for General Cell Designs
We have tested ALG2 on two sets of test cases. One set of examples contains clock nets of sizes 4, 8, and 16 on 16 blocks, and the other set contains clock nets of sizes 4, 8, and 16 on 32 blocks. Block sizes and layouts were assigned randomly in the grid by creating a fixed number of non-overlapping blocks, with length, width, and lowerleft coordinates all chosen from uniform distributions on the interval [0, 1OOOJ (i.e., l1 = l2 = 1OOO).
For each net size (and block number), 100 instances were generated randomly, and we compared the skew and cost of the ALG2 routing trees with those produced by the 1-Steiner heuristic [20] . Results are shown in Tables VI and VII. The skew of our clock tree is very close to zero. In no case is it more than 2% of the skew of the Steiner tree routing. The increase in total wirelength of our routing tree varies from 24% to 77 % when compared with the Steiner tree. The data in the tables is given in grid units.
As with the cell-based layout benchmarks, we ran SPICE simulations on a number of examples (again using MOSIS 2.0-pm CMOS technology and 0.3-pF gate loading capacitance). The actual skew of our clock tree is consistently much smaller than that of a Steiner tree. For a typical 16-pin clock net in a 16-block design, the skews of our clock tree and the Steiner tree are 18 and 69 ps, respectively.
For the routing tree produced by ALG2, we may have overlapping edges in a channel because matching paths at different levels may use the same channel. However, by Lemma 3, no channel segment will appear in more than Table VI1 shows the average edge density in channels, computed as the average of non-zero local column densities over all columns in all channels.
VI. REMARKS AND EXTENSIONS
We recommend that the global clock routing of ALGl or ALG2 be performed before other wiring, following standard practice. In this way, there are no wire-crossing conflicts since two layers of metal are used, one for horizontal wires, and the other for vertical wires. The exact routing of the clock tree topology may be determined in the detailed routing step.
For cell-based design, we can realize additional wirelength savings in our clock tree routing by varying the geometric embedding of individual wires in the layout. In the Manhattan metric, the "balance point" of a wire connecting two terminals is not unique but is rather a locus of many possible terminals (Fig. 15) , with the extremes corresponding to the two L-shaped wire orientations. Our current implementation sets the balance point of a segment to be its "Euclidean" midpoint, but this is not necessarily an optimal choice. Using a graph-theoretic formulation, we can easily derive a polynomial-time method, based on general graph matching, for finding the optimal set of balance points within these loci. The wire embedding at each level of our algorithm may also benefit from lookaheud of one or more levels, i.e., when we reach a situation where pathlength skew cannot be eliminated even via the utilization of an H-flip, we can go back one or two levels in the subtrees involved and try different H-flips during previous iterations on those subtrees. In our experience, this strategy easily allows complete elimination of clock skew at the current level, and requires only a constant amount of computation provided the lookahead depth (i.e., number of levels) is bounded by a constant. With respect to Fig. 15 , note that because the routing layers have different electrical characteristics, the choice of balance points must be optimized both with respect to locations and the actual embeddings of the wires incident to the balance point. If the layer assignment is prescribed, the balance point computation is straightforward. Alternatively, deciding between various optional embeddings may be accomplished using one level of lookahead as in [32] .
Another important extension lies in the selection of the CEP at each level. Instead of using the linear delay model to select a CEP, we may use a more accurate distributed RC model, to select the CEP so that clock skew is reduced by as much as possible. This is a strictly local modification of our method and does not affect the execution of the rest of the algorithm (or any variant). Such an extension applies to both ALGl and ALG2, and is particularly useful when varying capacitative loadings exist at the terminals of the clock net. Since our algorithm operates in a bottom-up fashion, and since we treat each level independently, our method is able to accommodate variable gate loading very nat~rally.~ ' We note that Tsay 1321 recently gave a clock routing algorithm which uses a bottom-up construction approach similar to the one described in this paper. Tsay's algorithm incorporates one level of look-ahead and the introduction of "extra" wire to achieve an exact zero-skew tree with respect to the Elmore delay model [13] . At each step, Tsay's method combines a pair of zero-skew trees to yield a new zero-skew tree of larger size. The linear-time ''Deferred-Merge Embedding" (DME) algorithm of [4] - [6] generalizes look-ahead in maintaining all loci of CEP's that are compatible with a zero-skew tree construction. DME thus reduces the cost of an initial clock tree topology computed by any previous method, while maintaining exact zero clock skew. In regimes where the linear delay model applies, the DME method produces the optimal (i.e., minimum-cost) zero-skew clock tree with respect to the prescribed topology, and this tree will also enjoy optimal source-terminal delay [4] , [5] . It is noteworthy that with respect to DME, our present matching-based approach yields topologies which lead to lower cost trees than such other initial topologies as those of [61, [181, P21. Finally, we mention that the PBT problem is interesting from a theoretical standpoint: the tradeoff between pathlength balance and total edgelength appears important not only for clock skew minimization, but also for a number of applications in areas ranging from computational geometry to network design.
In summary, we have presented a bottom-up approach for constructing clock routing trees, for both cell-based and general cell designs. Skew minimization is achieved by constructing the clock tree iteratively through geometric or graph matchings, while carefully balancing the pathlengths from the root to all leaves at each level of the construction. We verified our algorithm on numerous random examples, on industry benchmark circuits, and by SPICE timing simulations; the results show near-zero average clock skew while using total wirelength that compares favorably with previous work.
