In the design of high performance VLSI systems, minimization of clock s k ew is an increasingly important objective. Additionally, wirelength of clock routing trees should be minimized in order to reduce system power requirements and deformation of the clock pulse at the synchronizing elements of the system. In this paper, we rst present the Deferred-Merge Embedding DME algorithm, which embeds any given connection topology to create a clock tree with zero skew while minimizing total wirelength. The algorithm always yields exact zero skew trees with respect to the appropriate delay model. Experimental results show an 8 to 15 wirelength reduction over previous constructions in 17 18 . The DME algorithm may be applied to either the Elmore or linear delay model, and yields optimal total wirelength for linear delay. DME is a very fast algorithm, running in time linear in the number of synchronizing elements. We also present a uni ed BB+DME algorithm, which constructs a clock tree topology using a top-down balanced b i p artition BB approach, and then applies DME to that topology. Our experimental results indicate that both the topology generation and embedding components of our methodology are necessary for e ective c l o c k tree construction. The BB+DME method averages 15 wirelength savings over the previous method of 17 , and also gives 10 average wirelength savings when compared to the method of 25 . The paper concludes with a number of extensions and directions for future research.
Introduction
In synchronous VLSI designs, circuit speed is increasingly limited by t wo factors: i delay on the longest path through combinational logic, and ii clock s k ew, which is the maximum di erence in arrival times of the clocking signal at the synchronizing elements of the design. This is seen from the following well-known inequality g o verning the clock period of a clock signal net 2 1 7 :
clock period t d + t skew + t su + t ds where t d is the delay on the longest path through combinational logic, t skew is the clock s k ew, t su is the set up time of the synchronizing elements assuming edge triggering, and t ds is the propagation delay within the synchronizing elements. The term t d can be further decomposed into t d = t d interconnect + t d gates , where t d interconnect is the delay associated with the interconnect of the longest path through combinational logic, and t d gates is the delay through the combinational logic gates on this path. Increased switching speeds due to advances in VLSI fabrication technology will signi cantly decrease the terms t su , t ds , and t d gates . Therefore, t d interconnect and t skew become the dominant factors in determining circuit performance: Bakoglu 2 has noted that t skew may account for over 10 of the system cycle time in highperformance systems. With this in mind, a number of researchers have r e c e n tly studied the clock s k ew minimization problem.
Several results address formulations with inherently small problem size. For building block design styles, Ramananathan and Shin 21 h a ve proposed a clock distribution scheme which applies when the blocks are hierarchically organized. The number of blocks at each level of the hierarchy is assumed to be small, since the algorithm exhaustively enumerates all possible clock routings and clock bu er optimizations. Burkis 5 and Boon et al. 4 h a ve also proposed hierarchical clock tree synthesis approaches involving geometric clustering and bu er optimization at each l e v el. More powerful clock tree resynthesis or reassignment methods were used by Fishburn 13 and Edahiro 11 t o m i n i m i ze the clock period while avoiding hazards or race conditions; Fishburn employed a mathematical programming formulation, while Edahiro employed a clustering-based heuristic augmented by techniques from computational geometry. All of these methods are essentially limited to small problem sizes, either by their algorithmic complexity o r b y their reliance on strong hierarchical clustering. In contrast, we a r e i n terested in clock tree synthesis for at" problem instances with many sinks synchronizing elements, as will arise in large standard-cell, sea-of-gates, and multichip module designs.
Clock tree construction for designs with many clock sinks was rst attacked by the H-tree method, which was used in regular systolic arrays by Bakoglu and other authors 1 1 0 1 4 2 6 . The H-tree structure can signi cantly reduce clock s k ew 10 2 6 , but is applicable only when all of the sinks have identical loading capacitances and are placed in a symmetric array. A more robust clock tree construction for cellbased layouts is due to Jackson, Srinivasan and Kuh 17 : their method of means and medians" MMM algorithm generates a topology by recursively partitioning the set of sinks into two equal-sized subsets, then connecting the center of mass of the entire set to the centers of mass of the two subsets. While the MMM solution will have reasonable skew on average, Kahng et al. 18 gave small examples for which the source-sink pathlengths in the MMM solution may v ary by a s m uch as half of the chip diameter. In some sense, this re ects an inherent w eakness in the top-down approach: it can commit to an unfortunate topology early on in the construction. Kahng et al. 18 9 h a ve proposed a bottom-up matching approach to clock tree construction: in practice their method eliminates all source-sink pathlength skew, while using 5-7 less total wirelength than the MMM algorithm. However, as the method of 18 9 focuses primarily on pathlength balancing, their method addresses clock s k ew minimization only in the sense of the linear delay m o d e l . T s a y 25 uses ideas similar to both 17 and 18 , and achieves exact zero skew trees with respect to the Elmore delay model 12 2 2 . His algorithm was the rst to produce trees with exact zero skew in all cases. In the same spirit as the method of 18 , Tsay's method recursively combines pairs of zero skew trees at tapping points", analogous to the balance points" in 18 , to yield larger zero skew trees.
The primary motivation behind our wo r k i s t o m i n i m ize the total wirelength of clock routing trees while maintaining exact zero skew with respect to the appropriate delay m o d e l . T otal wirelength is a critical parameter of the clock routing solution since excess interconnect not only increases layout area but also results in greater tree capacitance, thus requiring more power for distribution of the clock signal. However, both the top-down method of 17 and the bottom-up methods of 18 9 2 5 concentrate on the problem of computing a clock t r e e topology, and only incompletely address the associated problem of nding a minimum-cost embedding of the topology. These previous methods are actually quite in exible in that they permanently embed each i n ternal node of the tree as soon as it becomes de ned 18 , or else choose the embedding with at most one level of lookahead in the tree construction 17 2 5 .
In this paper, we rst propose a new approach which a c hieves exact zero skew while signi cantly reducing the total wirelength of the clock tree. The basic idea of our Deferred-Merge Embedding DME algorithm is to defer the embedding of internal nodes in a given topology for as long as possible: i a bottom-up phase computes loci of feasible locations for the roots of recursively merged subtrees, and ii a top-down phase then resolves the exact embedding of these internal nodes of the clock tree. In practice, the DME algorithm begins with an initial clock tree computed by a n y previous method, then maintains exact zero clock s k ew while reducing the wirelength. In regimes where the linear delay model applies, our method produces the optimal i.e., minimum wirelength zero skew clock tree with respect to the prescribed topology, and this tree will also enjoy optimal source-sink delay. Experimental results in Section 4 below s h o w that the DME approach is highly e ective in both the Elmore and linear delay m o d e l s . W e a c hieve a verage savings in total clock tree wirelength of 15 over the MMM algorithm 17 and 8 over the method of Kahng et al. 18 . In all cases, our clock trees have exact zero skew according to the appropriate delay model, and our Elmore delay computations have been con rmed by SPICE simulations which show sub-picosecond skew on all benchmark examples.
Since the DME algorithm only optimizes a prescribed topology, i t c a n n o t a c hieve all possible improvement of the clock tree construction. Thus, to complement this successful embedding method, we also propose a new top-down heuristic for constructing an initial clock tree topology, based on the geometric concept of a balanced bipartition BB. Applying our embedding to topologies generated in this way yields a uni ed BB+DME algorithm which gives very promising results: we a c hieve 15 reduction in tree cost and as compared with the MMM algorithm 17 , and we a c hieve 10 reduction in tree cost and a 22 reduction in Elmore delay as compared with the method of Tsay 2 5 . 1 Again, all of our solutions have exact zero skew. Our methods are quite robust, and extend to prescribed skew formulations as well as more general optimizations of topologies for both clock routing and global routing. Furthermore, because our method implicitly maintains all possible minimum-cost embeddings of a topology, i t m a y be used to reroute the clock net while preserving minimum wirelength, as may be necessary when channel density m ust be minimized.
The remainder of this paper is organized as follows. In Section 2, we formalize the minimum-cost zero skew clock routing problem and also establish the linear and Elmore delay models that are used in the subsequent discussion. Section 3 presents our main results. These include: i the Deferred-Merge Embedding DME algorithm for e ciently embedding a given topology; ii application of the DME algorithm to both the linear and Elmore delay regimes; and iii our uni ed BB+DME algorithm, which uses a top-down balanced bipartitioning BB strategy to derive a good tree topology to which the DME algorithm may be applied. Section 4 gives experimental results and comparisons with previous work, and Section 5 concludes with directions for future research.
Problem Formulation
The placement phase of physical layout determines positions for the synchronizing elements of a circuit, which w e call the sinks of the clock net. A nite set of sink locations, denoted by S = fs 1 ; s 2 ; : : : ; s n g 2 , speci es an instance of the clock routing problem. A connection topology is de ned to be a rooted binary tree, G, w h i c h has n leaves corresponding to the set of sinks S. A clock tree TS i s a n e m bedding of the connection topology in the Manhattan plane. 2 The embedding associates a placement in 2 with each n o d e v 2 G; w e will use plT;v o r plv to represent this location. When no confusion arises, we m a y also denote plT;v s i m p l y b y v. T h e r o o t o f t h e c l o c k t r e e i s t h e c l o c k source, denoted by s 0 . W e direct all edges of the clock t r e e a way from the source; a directed edge from v to w may be uniquely identi ed with w and written as e w . W e s a y that v is the parent of w, a n d w is a child of v; the set of all children of v is denoted by childrenv. The wirelength, or cost, of the edge e w is denoted by je w j, a n d m ust be greater than or equal to the Manhattan distance between its endpoints plw a n d plv. 3 The cost of TS, denoted costT S, is the total wirelength of the edges in T S.
For a given clock tree T S, let t d s 0 ; s i denote the signal propagation time, or delay, on the unique path from source s 0 to sink s i ; the collection of edges in this path is denoted by paths 0 ; s i . The skew of TS is the maximum value of jt d s 0 ; s i , t d s 0 ; s j j over all sink pairs s i ; s j 2 S. If the skew of T S is zero then it is called a zero skew clock tree ZST. Given a set S of sinks, the zero skew clock routing problem is to construct a ZST T S o f m i n i m um cost. A variant of the zero skew clock routing problem asks for a minimum cost ZST with a prescribed connection topology:
Zero Skew Clock Routing Problem S,G: Given a set S of sink locations, and given a connection topology G, c onstruct a zero skew clock tree TS with topology G and having minimum cost.
The notion of a zero skew clock t r e e i s w ell de ned only in the context of a method for evaluating signal delays. The delay from the source to any sink depends on the wirelength of the source-sink path, the RC constants of the wire segments in the routing, and the underlying connection topology of the clock tree. 4 Using equations such as those of Rubinstein et al. 22 , one can achieve tight upper and lower bounds on delay in a distributed RC tree model of the clock n e t . H o wever, in practice it is appropriate to apply one of two simpler RC d e l a y approximations, either the the linear model or the Elmore model, both of which are easier to compute and optimize during clock tree design.
Delay Models

Linear Delay
In the linear delay model, the delay along paths 0 ; s i is proportional to the length of the path and is independent of the rest of the connection topology. Normalized by an appropriate constant factor, the linear delay b e t ween any t wo nodes u and w in a source-sink path is t LD u; w = X ev 2pathu;w je v j:
While less accurate than the distributed RC tree delay f o r m ulas of Rubinstein et al 22 , the linear delay model has been e ectively used in clock tree synthesis 18 21 . In general, use of the linear approximation is reasonable with older ASIC technologies, which h a ve larger mask geometries and slower packages. Tsay 25 notes that the linear delay model is also proper for emerging optical and wave i n terconnect technologies. In addition, we observe that linear delay applies to hybrid packaging technologies, which h a ve relatively large interconnect geometries 24 .
Elmore Delay
With smaller device dimensions and higher ASIC system speeds, a distributed RC tree model for signal delay in clock nets is often required to derive accurate timing information. Typically, w e use the rstorder moment of the impulse response, also known as the Elmore delay 6 8 25 . The Elmore delay model is developed as follows. Let and respectively denote the resistance and capacitance per unit length of interconnect, so that the resistance r ev and capacitance c ev of edge e v are given by j e v j and j e v j, respectively. F or each sink s i in the tree T S, there is a loading capacitance c Li which is the input capacitance of the functional unit driven by s i .
We l e t T v denote the subtree of TS r o o t e d a t v, and let c v denote the node capacitance of v.
5
The 4 The global routing phase of layout will typically consider the clock and power ground nets for preferential assignment t o dedicated routing layers. We assume that the interconnect delay parameters are the same on all metal routing layers, and we ignore via resistances. Thus, wirelength becomes a valid measure of the RC parameters of interconnections. 5 As noted earlier, we will assume that cv = 0 f o r e a c h i n ternal node in all of our examples and benchmarks.
tree c apacitance of T v is denoted by C v and equals the sum of capacitances in T v . C v is calculated using the following recursive formula: 
Main Results
This section presents our new uni ed approach to constructing a ZST over a give n s e t o f s i n k s S. A t a high level, we divide the construction of the ZST into: i generation of a connection topology, and ii embedding of that connection topology in the Manhattan plane. Our discussion begins with the DeferredMerge Embedding DME algorithm, which computes a wire-e cient e m bedding of a given topology. Next, we describe the application of the DME algorithm to both the linear and Elmore delay models. We then present a new top-down balanced b i p artition BB algorithm that creates a good connection topology, leading to the uni ed BB+DME algorithm.
The Deferred-Merge Embedding DME Algorithm
The Deferred-Merge Embedding DME algorithm embeds internal nodes of the topology G via a twophase process. A bottom-up phase constructs a tree of line segments which represent l o c i o f p o s s i b l e placements of the internal nodes in the ZST. A top-down phase then resolves the exact locations of all internal nodes in T . In the discussion that follows, the distance between two points p and q is assumed to be the Manhattan distance dp; q, and the distance between two sets of points P and Q, written dP;Q, is given by minfdp; q j p 2 P and q 2 Qg.
Bottom-Up Phase: The Tree of Merging Segments
For prescribed sink locations S and connection topology G, w e construct a tree of merging segments. T h e basic idea is as follows. Each n o d e v in G, is associated with a merging segment w h i c h represents a set of possible placements of v. The merging segment of a node depends on the merging segments of its two children, so the connection topology must be processed in a bottom-up order. In building the tree of merging segments, we also assign a length to each edge in G; this length is retained in the nal embedding of G as a ZST.
Let a and b be the children of node v in G. W e use T S a and T S b to denote the subtrees of merging segments rooted at a and b, respectively. W e are interested in placements of v which a l l o w T S a and T S b to be merged with minimum added wire while preserving zero skew. De ne the merging cost between T S a and T S b to be je a j + je b j, where je a j and je b j denote the lengths to be assigned to edges e a and e b . T h e s e lengths are chosen to minimize merging cost while balancing delays at plv. Because delay is a monotone increasing function of wirelength, there is a unique optimal assignment of lengths to e a and e b . 6 We n o w d e v elop more precisely the construction of the tree of merging segments. A Manhattan arc is de ned to be a line segment, possibly of zero length, with slope +1 or -1; in other words, a Manhattan arc is a line segment tilted at 45 degrees from the wiring directions. The collection of points within a xed distance of a Manhattan arc is called a tilted r ectangular region, o r TRR, whose boundary is composed of Manhattan arcs see Figure 1 . The Manhattan arc at the center of the TRR is called its core. The radius of a TRR is the distance between its core and its boundary. The merging segment of node v, msv, is de ned recursively as follows: if v is a sink s i , t h e n msv = fs i g. I f v is an internal node, then msv is the set of all placements plv w h i c h allow m i n i m um merging 6 The uniqueness is shown as follows. Suppose the minimum merging cost is c. De ne a function fjeaj to be the path delay from v to sinks in T S a for edge length jeaj; similarly de ne gje b j for the path delay from v to sinks in T S b . D e n e g 0 jeaj = gc , j eaj. A length assignment t o ea must satisfy fjeaj = g 0 jeaj, or alternatively, f , g 0 jeaj = 0 . I f both f and g are monotone increasing functions, then g 0 is monotone decreasing and f , g 0 is monotone increasing. Thus f , g 0 jeaj = 0 will have at most one solution.
cost, that is to say, all points within distance je a j of msa and within distance je b j of msb. If msa and msb are both Manhattan arcs, then we obtain the merging segment msv b y i n tersecting two TRRs, trr a with core msa and radius je a j, a n d trr b with core msb and radius je b j; i.e., msv = trr a trr b . The merging cost at v has an obvious lower bound of = dmsa; m s b. If the merging cost is greater than i.e., more wirelength is needed to balance the delays, then one edge length will equal zero and the other will equal the merging cost. Figure 2 illustrates the algorithm for the case where the merging cost is equal to , and Figure 3 illustrates the algorithm for the case where the merging cost is greater than . A n entire tree of merging segments is illustrated by Figure 4 . The leaves of the tree of segments are all single points representing the sink locations s 1 ; : : : ; s 8 , and the internal nodes are Manhattan arcs.
We prove that all merging segments are Manhattan arcs using induction and the following lemma. Proofs of all lemmas are given in the Appendix. Lemma 1 : The intersection of two TRRs, R 1 and R 2 , is also a TRR and can be found in constant time.
If radiusR 1 + radiusR 2 = dcoreR 1 ; coreR 2 , then the TRR R 1 R 2 is also a Manhattan arc. 
Top-Down Phase: Embedding of Nodes
Once the tree of segments has been constructed, the exact embeddings of internal nodes in the ZST are chosen in a top-down manner. For node v in topology G, i i f v is the root node, then select any point in msv t o b e plv; 7 or ii if v is an internal node other than the root, choose plv t o b e a n y point in msv that is at distance je v j or less from the placement o f v's parent p because the merging segment msp w as constructed such t h a t dmsv; m s p j e v j, there must exist some choice of plv satisfying this condition. In case ii, the algorithm rst creates a square TRR trr p with radius je v j and core equal to fplpg; then, plv c a n b e a n y point from msv trr p see Figure 6 . For the tree of merging segments in Figure 4 , the resulting placements are indicated by the points at which the segments are connected by dotted lines. Figure 7 describes the procedure Find Exact Placements, which uses the tree of merging segments to determine the nal embedding of nodes in the ZST.
The time complexity of DME is analyzed as follows. Because each instruction in Find Exact Placements is executed at most once for each n o d e i n G and the intersection of TRRs msv and trr p can be found in constant time by Lemma 1, Find Exact Placements runs in time linear in the size of S. Because procedure Build Tree of Segments also runs in linear time, DME as a whole is a linear-time algorithm.
3.2 Application of DME to Linear Delay
Calculating Edge Lengths
Calculating the edge lengths je a j and je b j is straightforward in the linear delay model. Let a and b be children of v with merging segments msa and msb, and let t LD a a n d t LD b be the delays from a and b to the sinks in their respective subtrees. Then, zero skew at v requires that t LD a + je a j = t LD b + je b j:
Again, let = dmsa; m s b. If jt LD a , t LD bj , then the merging cost is minimized with je a j + je b j = , i.e., je a j = + t LD b , t LD a 2 and je b j = , j e a j:
On the other hand, if jt LD a , t LD bj , then the merging cost is minimized when one of the edge lengths is equal to zero. It is easy to see that if t LD a t LD b, then je a j = 0 a n d je b j = t LD a , t LD b; similarly, i f t LD a t LD b t h e n je b j = 0 and je a j = t LD b , t LD a.
Optimality of DME for Linear Delay
The following theorem states that the DME algorithm is optimal in the linear delay regime.
Theorem 1 Given a set of sink locations S a n d a c onnection topology G, the DME algorithm produces a ZST T with minimum cost over all ZSTs for S having topology G.
The proof of Theorem 1 relies on Lemmas 2 and 3. Lemma 2 asserts that for any n o d e v in an optimal ZST, plv i s i n msv a n d m ust therefore satisfy the constraints imposed in the bottom-up phase of the algorithm. Lemma 3 implies that the placements of two sibling nodes correspond to a closest pair of points in their respective merging segments. Together, Lemmas 2 and 3 can be used to show that placements in an optimal ZST must satisfy the top-down phase of the algorithm. Let t LD T;x denote the delay i n Z S T T between a point x in T and each sink which h a s x on its source-sink path. Lemma 2 : Given a ZST T with topology G, l e t v be an internal node with children a and b. S u p p ose the subtrees of T rooted a t a and b can be generated by the DME algorithm for some placement of v on msv, and also suppose that q = plT;v 6 2 msv. Then a new ZST T 0 with the same topology can be constructed f r om T by moving the placement of v so that the following hold: i q 0 = plT 0 ; v 2 msv; ii costT 0 costT; and iii t LD T;q = t LD T 0 ; q . Lemma 2 is illustrated in Figure 8 . The construction of T 0 from T reduces the tree cost by modifying the q a and q b connections so that they share wire on the segment f r o m q to q 0 .
Lemma 3 : Suppose that a and b are two sibling nodes in ZST T with parent v, and suppose that the subtrees of T rooted a t a and b can be generated using the DME algorithm. Proof of Theorem 1: T h e p r o o f i s b y c o n tradiction. The DME algorithm places only two constraints on the placement o f a n o d e v in G: i plv 2 msv and ii dplv; p l p L v , where p is the parent of v and L v is the edge length assigned by DME to e v . Condition i arises by the construction in the top-down phase of DME, and condition ii is required by the bottom-up phase of DME. Suppose ZST T has minimum cost for point set S and topology G, but contains a node placement violating one of the two conditions. Let v be a node with greatest depth in T that violates either condition, and let w be the sibling of v. Because v has maximum depth, all of the descendants of v and w can be produced using DME. Consequently, because T has minimum cost, Lemma 2 implies that plT;v m ust be in msv a n d plT;w must be in msw. Thus, v does not violate condition i.
Consequently, v must violate ii, i.e., dplT;v; p l T;p L v . L e t LT;e v denote the length of edge e v in T . Because the length of an edge must be at least the distance between its endpoints, LT;e v L v . Suppose dplT;v; p l T;w dmsv; m s w. Then the subtrees of T rooted at v and w can be generated by DME for some placement o f p on msp, and by Lemma 2, costT can be improved by m o ving p to its merging segment and setting LT 0 ; e v = L v and LT 0 ; e w = L w . I f dplT;v; p l T;w j t LD v , t LD wj, t h e n costT can be reduced by m o ving plp t o plv i f L v = 0 , o r t o plw i f L w = 0. Hence, we must have dplT;v; p l T;w d msv; m s w, and dplT;v; p l T;w jt LD v , t LD wj. Then by Lemma 3 costT can be decreased, contradicting the assumption that T has minimum cost.
It can be proved that in the linear model, DME also minimizes the source-sink delay in a ZST, and that this delay is equal to one-half the diameter of the sink set S. A proof of this result is contained in 3 .
The DME algorithm is also optimal for any topology in the variant of the ZST problem where the source location is pre-de ned. Suppose that mss 0 is the merging segment f o r t h e r o o t n o d e s 0 of topology G and that s 0 0 is the prescribed source location. The DME algorithm can be modi ed at the beginning of the procedure Find Exact Placements to connect s 0 0 with the closest point i n mss 0 . This point becomes pls 0 . Lemmas 2 and 3 can be used to prove the optimality of this method: they state that any tree rooted at a location q 6 2 mss 0 will have minimum cost only if the two subtrees of G directly below the root are merged at a point q 0 
To calculate the edge lengths needed to merge two trees of merging segments T S a and T S b with minimum merging cost in the Elmore model, we use the analysis of Tsay 2 5 . Let T S a and T S b respectively have capacitance C 1 and C 2 and delay t 1 = t E D a a n d t 2 = t E D b, and let plv be a merging point with minimum merging cost. The above analysis shows that a zero skew merging point b e t ween two ZSTs can always be found. The merging cost depends on the distance between the two roots of the ZSTs, the delay o f e a c h ZST, and the tree capacitance of each Z S T . I n tuitively, to minimize the merging cost we should therefore choose topologies such that merged subtrees have minimum distance between their roots, along with similar capacitances and delays, so as to avoid the extra cost 0 , . This motivates our new BB algorithm, which uses the geometric notion of a balanced bipartition for computing a topology. Before describing this algorithm in Section 3.4 below, we observe that the DME algorithm is not optimal for all topologies in the Elmore delay approximation model. Figure 10: ZST T, which w ould be constructed by the DME algorithm with sub-optimal cost for its topology. Note that the tree is not drawn to scale; lengths of horizontal and vertical segments are as indicated.
Suboptimality of DME for Elmore Delay
Recall that in the linear delay regime, the DME algorithm produces an optimum minimum wirelength ZST for any given topology. Our experimental results in Section 4 clearly show the e ectiveness of the DME algorithm in the Elmore delay model, and indeed we believe that in practice the algorithm gives solutions that are very close to optimum. H o wever, the ZSTs T in Figure 10 and T 0 in Figure 11 demonstrate that, for some sink sets and topologies, DME will not be optimal for Elmore delay. T and T 0 connect terminal points s 1 ; : : : ; s 6 to source s 0 . Both trees are assumed to extend to the right side of s 0 , with their subtrees on the right o f s 0 being mirror images of the subtrees to the left of s 0 this ensures that the source will be at s 0 in the optimal tree. In this example, we set both the unit resistance and unit capacitance to one, and the loading capacitance c Ls of each sink node s to zero. 8 The ZST T 0 in Figure 11 was constructed so that if points s 1 and s 2 are merged at point p 0 1 , t h e n v ertical wires from points s 3 through s 6 will merge along the horizontal wire from s 1 to s 0 with exactly zero skew. If, however, s 1 and s 2 are merged on their merging segment as shown in the tree T of Figure 10 , the delay at p 0 1 will increase, and jogs will be required in the edges e s3 through e s6 . In this example, the four required jogs are each of length greater than 0.3. Thus, their sum is greater than 1, which w as the amount of wire saved initially by merging s 1 and s 2 at p 0 . Figure 10 , but which violates the DME algorithm. In T 0 , t h e i n ternal nodes placed at p 0 and p 1 in T are placed at the same point, p 0 1 . The tree is not drawn to scale; lengths of horizontal and vertical segments are as indicated. Table 1 Because unit resistance and capacitance both equal one, and because loading capacitances at the leaves are zero, the tree capacitance of each node equals the amount of wire in its subtree. Thus, we see in Table 1 that costT 0 is less than costT b y 0.44. 
Topology Generation
It is easy to see that, as hinted by the examples of Figures 10 and 11 , the choice of topology will a ect the success of the DME embedding. We n o w present a new heuristic for generating connection topologies. 9 The heuristic works in top-down fashion, dividing the sink nodes recursively into two partitions with nearly equal total loading capacitance. We call this heuristic the Balanced Bipartition BB method. The BB method o ers a more powerful top-down partitioning scheme than the previous approaches of Jackson et al. 17 and Tsay 2 5 , which divide the sink set recursively, using only alternating horizontal and vertical cuts.
For our description of the BB method, we i n troduce the following notation. Denote the diameter of S by diaS = maxfdp; q j p; q 2 Sg and the number of sinks in S by jSj. Since the cost of any routing tree of S is greater than diaS and less than jSj diaS 2 , w e consider diaS to be a heuristic approximation of the cost of any ZST TS. Recall also that imbalanced loading capacitance may lead to excess edge length in the DME construction; we call a bipartition of a set of sinks S into two subsets S 1 and S 2 a balanced bipartition if the di erence between the total loading capacitances of the two subsets is at most maxfc Li g. 10 Intuitively, w e w ould like to nd a balanced bipartition which divides set S with minimum partition cost, given by diaS 1 + diaS 2 . This is the idea behind the BB heuristic. In the Euclidean metric, the problem of constructing a balanced bipartition which minimizes the sum of diameters can be solved in On fp:y , p:xg; 9 No NP-completeness result has been obtained for our general minimum-cost zero skew clock tree formulation i.e., where the topology has not been prescribed. However, 18 9 s h o wed that a closely related problem in the linear delay model, the bounded-skew pathlength-balanced tree problem", is trivially NP-complete since it reduces the minimum rectilinear Steiner tree problem when the allowed pathlength skew is in nite. Thus, heuristics for computing promising topologies are of interest. 10 For the linear delay m o d e l , w e use uniform loading capacitances in the input to the BB algorithm, because delay depends only on the edge lengths. Figure 12a . Note that the octagon set construction naturally captures those parameters of the sink set which are relevant to diameter computations in the Manhattan plane. Based on extensive experimental investigations, we have found that each of the sets S 1 and S 2 in a balanced bipartition of S is likely to consist of consecutive elements in OctS. Based on this observation, a balanced bipartition heuristic is as follows. 3. For each s i n k p 2 S, compute the weight of p, equal to min r2RE Fi dp; r +max r2RE Fi dp; r.
4. Sort the sinks in ascending order of weight, then add sinks according to this order to S 1 until the di erence between the sum of capacitances in S 1 and one half the total capacitance is minimized.
5. The remaining sinks are placed in S 2 , and the partition cost diaS 1 + diaS 2 is obtained. Figure 12c ; etc. After all six reference sets have been evaluated, we nd that the optimal reference set is REF 2 with cost 270. Figure 12d shows the output of the BB+DME algorithm on the instance of Figure 12a . 11 The time complexity of the BB algorithm is a ected by c haracteristics of the sink set S. The number of times that the loop over steps 3 -5 must be repeated is given by jOctSj, the number of reference sets.
In the worst case this value is n, but in practice it is usually bounded by a constant. Because BB is recursive, its complexity is also a ected by the relative sizes of the bipartitions. In the worst case, when loading capacitances are very unbalanced, we c a n h a ve jS 1 j = 1 and jS 2 j = jSj , 1.
Steps 3 and 4 dominate all others in the complexity of BB and are repeated for each reference set. The diameters in step 5 can be calculated in linear time in the Manhattan metric.
Step 4 requires On log n operations each time it is run, while step 3 requires OnjOctSj time. If jOctSj = n, then the total time used in step 3 for a single bipartition can be reduced from On 3 t o On 2 log n b y using a priority queue such as a Fibonacci heap. 12 In the very worst case, we c a n h a ve jOctSj = n and pathologically unbalanced loading capacitances; each bipartition will require On 2 log n time and the total time complexity of BB will be On 3 log n. If jOctSj = O1 but loading capacitances are still unbalanced, the time complexity will be On 2 logn. The time complexity is reduced when we i m p o s e v ery reasonable constraints on the loading capacitances, e.g., the largest and smallest capacitances can di er by at most a constant factor, or simply that the cardinalities of the partitions di er by at most a constant factor. If the loading capacitances are balanced" 11 For the Elmore delay m o d e l , w e observe that the DME algorithm is not always optimal for topologies generated by balanced bipartitioning. T o see this, we modify the counter-example of Section 3.2 as follows. Let the loading capacitance of each sink be a small xed value 0. Suppose that there are 16 sink nodes near point s6 w i t h i n a v ery small radius 0 of each other. Similarly, suppose there are 8 sink nodes at point s5, 4 at s4, 2 at s3 a n d 1 a t b o t h s1 and s2. Then the BB algorithm will generate the topology of Figure 10 . 12 The priority queue, however, will increase the worst-case space requirements from On t o On 2 . 
Experimental Results
The BB and DME algorithms were implemented on Sun SPARC w orkstations in the C UNIX environment. The code can be obtained from the authors. We compared routing cost and source-sink delay o f t h e BB+DME output with previous results of Jackson et al. Because the DME algorithm can be applied to any prescribed topology, w e also applied it to topologies obtained in previous studies. In this way, w e could separate the e ects of DME from the e ects of complementary heuristics for generation of clock tree topologies. We u s e d t wo s e t s o f b e n c hmarks: i sink placements for the MCNC benchmarks Primary1 and Primary2 used in 17 and 18 , and originally provided by the authors of 17 Primary1 contains 269 sinks, and Primary2 contains 603 sinks; and ii sink placements for the ve b e n c hmark sets r1 -r5 used in 25 the sizes of these examples range from 267 to 3,101 sinks.
reduction by reduction by reduction by KCR+DME BB+DME BB+DME number MMM KCR KCR+DME from BB+DME Table 2 : Comparison of BB+DME with other algorithms in the linear delay model using MCNC benchmarks Primary1 and Primary2 and benchmarks r1 through r5 from Tsay.
Linear Delay M o d e l
Our experimental results for linear delay are contained in Table 2 . We compared BB+DME with the Method of Means and Medians MMM of Jackson et al. 17 and with the bottom-up, matching based method of Kahng, Cong and Robins KCR 18 . In order to test the performance of the DME algorithm alone, we also ran DME on the topologies produced by the KCR algorithm. The combined BB+DME algorithm produced an average reduction in cost of 15 from the MMM results. We also obtained an 8 average cost reduction from the KCR algorithm. Note that in the linear model, DME also produces trees with optimal source-sink delay 3 , and our experiments showed an average reduction of 19 from the KCR algorithm. The improvement in source-sink delay ranged from 9 for Primary1 to 23 for r3.
reduction by reduction by BB+DME BB+DME number MMM Tsay Tsay+DME KCR+DME BB+DME Table 3 : Comparison of BB+DME with other algorithms in the Elmore delay model. Results for Tsay's algorithm were obtained from Dr. Ren-Song Tsay a n d w ere not available for the Primary1 and Primary2 benchmarks.
Elmore Delay M o d e l
We tested the BB+DME algorithm for Elmore delay on the same benchmark sink sets. The results are contained in Table 3 . Again, these results indicat a signi cant i m p r o vement b y BB+DME over previous algorithms. The average reduction in wirelength was 14 over MMM results, and 10 over the results of Tsay. It should be noted that DME alone resulted in an average improvement o f o n l y 2 o ver Tsay's algorithm, which can be attributed to the fact that Tsay's embedding algorithm allows deferral of the choice of placements for one level in the tree the two e n d p o i n ts of each merging segment are selected and carried to the next level, where the actual embedding is chosen to be the point w h i c h a l l o ws the minimum connection cost.
13
Our results also indicate a very signi cant reduction in source-sink delay in the Elmore model: the combination of KCR+DME reduced delay o ver the trees of Tsay b y a n a verage of 22.
To obtain a more complete picture of the BB+DME performance, we also tested the algorithm on sink sets with locations chosen randomly from a square grid, i.e., with coordinates s i :x; s i :y 2 ,2500; 2500 .
The size of the sink sets ranged from 8 to 64. In these experiments, we also compared our algorithm with minimum rectilinear Steiner trees RSTs constructed by the heuristic in 7 ; the BB+DME tree cost was only 64 above the heuristic RST cost. Finally, w e used the circuit simulator SPICE2G.6 20 t o e v aluate 13 A surprising outcome of our experiments was the strong performance of topologies generated by the KCR algorithm. The combination of KCR and DME actually outperformed BB+DME by a n a verage of 2.5 on the seven benchmarks. We expected balanced topologies to be superior in the Elmore delay model where the amount o f l o a d o n e a c h line a ects delay, but our experimental results indicate that a bottom-up approach originally designed for the linear delay model can perform as well or better. However, we note that KCR uses such t e c hniques as H-ipping and uncrossing of matching edges; the latter has exponential worst-case time complexity. Moreover, the minimum-diameter bipartitioning approach of BB is probably more useful when the distribution of sink locations is highly pathological.
clock s k ew in the ZSTs generated on the random sink sets. For both the MMM and BB+DME clock trees, SPICE decks were generated with the following speci cations. The routing area was assumed to be 0:5cm0:5cm, and all the parameters were based on a 1.2m CMOS technology. An input clock frequency of 100 MHz and a superbu er driven by the input clock source were assumed. The delays between the source and the sink nodes were measured at the output node of the inverter which d r i v es the sink nodes. Table 4 shows the average maximum delays, minimum delays and clock s k ews for the sinks sets of each size. The maximum delay of BB+DME was on average 3 less than that of MMM. The average skew of MMM was 9.2 picoseconds while that of BB+DME was only 0.5 picoseconds, a 93 reduction. Figure 13 shows the output of the BB+DME algorithm on an instance containing 64 sinks. The total routing length is 50445m and the source-sink delay is 0.91ns. By contrast, the MMM algorithm yielded a tree with cost 59256m and delay 0.94ns for this case. Table 4 : Mean delay time and clock s k ew for random sink sets time unit = picosecond. The rightmost three columns display ratios between the results of BB+DME and MMM.
Conclusions and Directions for Future Work
Minimization of clock s k ew is critical to the design of high-performance VLSI systems. Recent research has yielded a number of heuristics which e ectively eliminate skew according to either the Elmore or linear delay m o d e l . H o wever, these previous methods concentrate on generation of the clock tree topology, a n d then embed the topology in the plane with little concern for the minimization of total wirelength.
Obviously, m i n i m ization of total wirelength will lead to reduction of wiring area, with the added e ect of less blockage for subsequent routing phases of layout. We also note that clocking accounts for a large portion of system power requirements: wire minimization can signi cantly reduce the power needed to drive the clock signal, thus improving system feasibility and reliability. Finally, wirelength reduction will improve performance by lessening such e ects as pulse narrowing, pulse deformation, etc. Given these considerations, our work gives a uni ed approach t o c l o c k tree construction which combines the topology S0 Figure 13 : An example of a ZST produced by BB+DME for 64 randomly chosen sink nodes.
generating phase BB with the embedding phase DME.
The balanced bipartition BB heuristic generates a connection topology by recursively dividing the set of sinks into two subsets with equal total loading capacitance while at the same time minimizing the sum of diameters of the two subsets. This balance condition is a novel aspect of the method, and is useful when delay depends on both pathlength and capacitance, as in the Elmore model. The partitioning strategy based on minimizing the sum of diameters improves upon previous top-down bisection strategies of Jackson et al. 17 and Tsay 2 5 , which can only use horizontal or vertical cuts to partition the set of sinks.
The Deferred-Merge Embedding DME algorithm o ers many i m p r o vements over previous embedding schemes. DME constructs a highly exible tree of merging segments which a l l o ws a choice among minimumcost zero skew clock trees. Given any connection topology over the set of sink locations, DME always produces a tree with exact zero skew, and may t h us be applied to previously generated clock trees in order to improve both wirelength and delay. Experiments show that applying DME alone to the clock t r e e s constructed by other algorithms results in wirelength reductions of 2 to 9. The DME algorithm also extends to problem formulations where the clock source is prescribed. Finally, g i v en the linear delay m o d e l , DME yields optimal total wirelength and optimal source-sink delay.
Our experimental results indicate that the BB+DME methodology yields routing solutions with exact zero skew which w e con rmed to be in the subpicosecond range using SPICE2G.6 and signi cantly reduced total wirelengths 8 -15 less than the best previous methods. Furthermore, the superiority of BB+DME over previous methods depends on their joint application. For instance, our improvement of approximately 8 over the matching-based method of Kahng et al. KCR 18 is directly attributable to the DME embedding, since DME applied to topologies generated by K CR yields clock tree cost very similar to that obtained using BB+DME. On the other hand, DME alone can achieve only 2 out of the 15 improvement of BB+DME over Tsay 2 5 . Thus 13 of the cost savings can be attributed to the BB topology.
There are many promising extensions to our current approach. The DME algorithm readily applies to problems of prescribed skew i.e., useful" skew 1 , where the arrival times of the clocking signal must di er by prescribed amounts. This is handled by setting initial delays at the sinks to non-zero values. The DME algorithm can also be used for problems with allowed skew 1 1 3 2 5 , where the signal must arrive at each sink within a prescribed segment of time.
Finally, the general issue of topology generation remains an important area for further investigation. A promising approach is to run DME concurrently with matching-based and other bottom-up topology generating heuristics. In general, the construction of optimal topologies appears to be very di cult perhaps NP-hard. However, we expect future investigations in this area to have fruitful applications, for both clock tree construction and the broader area of high-performance routing.
Remarks and Acknowledgements
Through independent research, the two groups of authors came up with essentially identical approaches to constructing zero skew clock routing trees with minimum wirelength for a given tree topology. The major di erences between the two treatments are: i Chao, Hsu and Ho apply DME to the Elmore delay m o d e l , while Boese and Kahng establish the theoretical results for DME with respect to both the linear and Elmore delay models; and ii Chao, Hsu and Ho proposed the top-down balanced bipartition technique to generate an initial clock tree topology, while Boese and Kahng assume arbitrary existing tree topologies, e.g., those derived from the KCR method 18 9 . The work of Chao, Hsu and Ho 8 appeared at the 29th ACM IEEE Design Automation Conference; the work of Boese and Kahng 3 appeared at the 5th IEEE International Conference on ASIC. The authors are grateful to Dr. Ren-Song Tsay for providing benchmark data and for his communications which made this collaboration possible. Since rotating each TRR by 45 degrees requires constant time, determining the intersection of the two TRRs R 1 R 2 also requires only constant time.
If radiusR 1 +radiusR 2 = dcoreR 1 ; coreR 2 , then decreasing the radius of either R 1 or R 2 must cause their intersection to become empty; otherwise, we could form a path between coreR 1 a n d coreR 2 with length less than dcoreR 1 ; coreR 2 . Consequently, R 1 R 2 must have zero width and be a line segment o r a s i n g l e p o i n t. Since R 1 R 2 is also a TRR, it must be a Manhattan arc. De ne a straight-line path between two p o i n ts x and y to be any m i n i m um-length path between them using only vertical and horizontal lines. If x and y are not on the same horizontal or vertical line, then there will be an in nite number of straight-line paths between them. De ne the projection area P A x; Q from a point x through a set of points Q as the set of all points p for which there exists a straight-line path from x to p that passes through Q. Q must be between p and x. Figure 15 contains an example of the projection area from a point x through a Manhattan arc Q.
The next lemma about projection areas will be used to prove Lemmas 2 and 3. It states that the union of two projection areas from points p and q, respectively, through a merging segment ms between them, is the entire plane. Lemma 2 : Given a ZST T with topology G, l e t v be an internal node with children a and b. S u p p ose the subtrees of T rooted a t a and b can be generated by the DME algorithm for some placement of v on msv, and also suppose that q = plT;v 6 2 msv. Then a new ZST T 0 with the same topology can be constructed f r om T by moving the placement of v so that the following hold: i q 0 = plT 0 ; v 2 msv; ii costT 0 costT; and iii t LD T;q = t LD T 0 ; q .
Proof: Consider Figure 8 of Section 3.2.2. Let a and b be the placements in T of v's children. By Lemma 4, there exists a point q 0 on msv such that there is a straight-line path either from a to q or from b to q, that passes through q 0 . Without loss of generality, assume that this path is from b to q. Because bq 0 q is a straight-line path, segment bq in T can be replaced by segments bq 0 and q 0 q in T 0 without changing the delay b e t ween b and q, and leaving the delay a t p o i n t q unchanged. Moreover, the construction of msv ensures that zero-skew is maintained by setting the edge e a equal to the segment aq 0 and plT 0 ; v = q 0 . De ne lengthT;xy to be the edge length between points x and y in ZST T. Because the delay a t q remains unchanged in T 0 and the a|q and b|q connections share wire between q 0 and q in T 0 , w e m ust have costT 0 = costT , lengthT 0 ; q 0 q. Lemma 3 : Suppose that a and b are two sibling nodes in ZST T with parent v, and suppose that the subtrees of T rooted a t a and b can be generated using the DME algorithm. If da; b d msa; m s b and da; b jt LD a , t LD bj, then a new ZST T 0 can be c onstructed f r om the same topology, with costT 0 costT and with t LD T;q = t LD T 0 ; q for q = plT;v.
Proof: See Figure 9 in Section 3.2.2. To p r o ve the lemma, we will rst construct a ZST T new with source at q = plT;v, and then replace the subtree of T rooted at v with part of T new to create T 0 . U s i n g Theorem 2 of 3 w e show that the connections a|q and b|q share wire on a partial edge e q 0 in T 0 , whereas they do not share wire in T. Because T 0 is also constructed so that the lengths of the a|q and b|q connections are the same as in T , t r e e T 0 will have l o wer cost than T.
Let G v be the subtree of topology G rooted at v, and let S v be the set of sinks in G v . S u p p o s e t h a t sink s i is the sink in S v furthest from q. Create a new sink z that is located at a point directly opposite of q from s i ; i.e., dq;s i = dq;z a n d ds i ; z = 2 dq;s i . Consider a new set of sinks: S new = S v f zg.
We create a topology G new for S new that merges G v and z at its root, s new0 . W e then run DME on S new using topology G new to create ZST T new . B y T h e o r e m 2 o f 3 , T new will have minimum feasible delay at each sink, equal to one-half the diameter of S new , speci cally dq;s i . By the Fact used in the proof of Theorem 2 of 3 , mss new0 is the set of all points within distance dq;s i o f e v ery sink in S new .
Therefore, q 2 mss new0 and T new can be constructed so that q = plT new ; s new0 . Let a 0 = plT new ; a , b 0 = plT new ; b , and q 0 = plT new ; v . We n o w construct ZST T 0 for S by cutting o the subtree of T rooted at q and replacing it with T new minus the edge between q and z. Since t LD T 0 ; q = dq;s i , it must be that t LD T 0 ; q t LD T;q. If the strict inequality h o l d s , w e add extra wire between q and q 0 to enforce equality, and thereby retain zero skew.
For convenience, we u s e e a 0 and e b 0 to represent the embeddings of edges e a and e b in T 0 . W e a l s o u s e e q 0 to denote the partial edge between q 0 and q in T 0 . Because the subtrees of T rooted at a and b were constructed according to DME, we h a ve t LD T;a = t LD T 0 ; a 0 a n d t LD T;b = t LD T;b 0 . Thus, because t LD T 0 ; q = t LD T;q, it must be that je a j = je a 0 j + je q 0 j and je b j = je b 0 j + je q 0 j: 
