In a complete physical synthesis flow, optimization transforms, that can improve the timing on critical paths that are already well-optimized by a series of powerful transforms (timing driven placement, buffering and gate sizing) are invaluable. Finding such a transform is quite challenging, to say nothing of efficiency. This work explores innovative cloning (gate duplication) techniques to improve timingclosure in a physical synthesis environment.
INTRODUCTION
Physical synthesis is a complex process that combines physical design with synthesis to perform design closure. As described in [7] , physical synthesis typically consists of several stages including placement, legalization, critical path optimization, etc. Among these stages, the critical path optimization stage is particularly important. It takes a design that is legally placed and initially opPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISPD '10, March 14-17, 2010 , San Francisco, California, USA. Copyright 2010 ACM 978-1-60558-920-6/10/03 ...$10.00. timized for timing, and restructures critical paths by applying a multitude of different transforms, such as gate sizing, V th tuning, and buffering. It is often not difficult to optimize timing on some nets early in a physical synthesis flow. However, it is usually more challenging to improve timing if the nets have been optimized by a series of powerful transforms in a physical synthesis flow. This imposes great challenges in critical path optimization.
We believe that fixing all types of timing problems will require some new transforms to be developed, because each transform may fix certain problematic structures. In this paper, we design several highly efficient cloning techniques, also known as cell replication, to improve delays along critical paths. Cloning is not a new synthesis optimization. For example, Brglez [8] and Hwang et al. [10] use cloning as a mechanism to reduce net cut during partitioning, and [11, 13] study cloned gate placement in the FPGA domain. Since cloning helps in reducing the total capacitance loading of a high-fanout net, many works [16, 14, 12] focus on technology independent delay optimization. Under the load-dependent gate delay model with zero-wire delay, the authors of [16] prove that the cloning problem is NP-complete. Under the same delay model, the authors of [17] propose a right-to-left cloning scheme to improve the timing of a technology-mapped circuit. Due to the computational complexity of the problem, heuristics are often proposed to speed up the technique. However, all of these works neglect key features of the problem: the placement of the duplicated gate and interconnect delay. Thus, these models can be used in the logic synthesis stage of design but will be less accurate during the core stages of a physical synthesis flow.
For today's deep sub-micron technologies, previous cloning algorithms, which ignore wire delay, buffering and placement, are largely ineffective in the context of critical path optimization. This is explained in part by interconnect scaling, which has necessitated that buffers be inserted at shorter intervals to overcome wire resistance. Only very short wires will not require buffers [1] . Consequently, when one wants to apply cloning to improve path delay, buffers that have been inserted previously limit the scope of cloning for timing improvement. To make cloning effective, one must account for buffers by considering only non-buffer sinks, and re-buffering the resulting subcircuit.
To the best of our knowledge, the only work which handles both cloning and buffer insertion in the placement stage is BufDup [15] . Unfortunately, they consider cloning and buffer insertion separately. In addition, BufDup uses a timing-oblivious, simple k-means based clustering algorithm to partition the fanout gates. It does contain a timing-driven post-processing step, but it can only be used to balance the capacitance loading of the two partitions and is not designed to improve timing. In contrast to [15] , our cloning is based on a linear-delay model [2, 3] with the knowledge that buffered interconnect delay is linearly proportional to its length. This model handles simultaneous buffering and cloning in an abstract and unified way. Adoption of such a delay model also helps to reduce the complexity of the gate cloning problem. This work reveals that cloning with buffer-aware linear-delay model can be accomplished very efficiently in polynomial time. Thus, such a problem is not NP-complete any more.
Recently, there are some works on simultaneous timing-driven gate placement and buffering. RUMBLE [9] uses linear-delay buffering model and linear programming techniques to solve timing-driven latch and gate placement problem considering practical constraints. Pyramid [18] uses computational geometry techniques to efficiently solve one gate placement problem with a similar delay model. Note that the timing-driven gate placement problem is subsumed by the timing-driven gate cloning problem, since a fixed sink partitioning reduces the cloning problem to a gate placement problem. Thus, the cloning problem is complicated by the need to find sink partitions and gate placements simultaneously.
An example of simultaneous cloning and buffering is shown in Figure 1 . The arrival time of F1 and F2 are 0, and the required arrival time of S1 and S2 are 5. Consider the instance in Figure  1 (a) where we consider cloning gate P . There are two sinks S1 and S2 with slacks +1 and -1. The delays from fan-ins F1 and F2 to P are 1 and 3 respectively, as are the delays from P to S1 and S2, including the delay of buffers and wires along the path. If we clone P to P while leaving the original buffer trees intact, we may get the result shown in Figure 1 (b) in which P is placed very close to P , and the slack only improves to -0.5. Here the new location of P is restricted by the buffers that must drive both P and P . However, if one restructures the buffering solution to eliminate this constraint, one can obtain the far superior solution in Figure  1 (c) which increases both slacks to +1 and obtains the physically shortest possible paths from F1 and F2 to S1 and S2. This example suggests that one must consider buffering and cloning together to effectively reduce delay.
Timing-driven buffering alone can be computationally expensive when used excessively [4] . It is also difficult to use it to derive any theoretical guidance for simultaneous cloning and buffering. To be most accurate, one should explore all possible partitionings of sinks for each net, find gate placements (i.e, with the technique in [9] ), re-buffer with dynamic programming, and legalize the design. The whole process is highly expensive when applying to modern designs with hundreds of thousands of nets. It may also waste the majority of its runtime, because in many cases the new solution may be worse than the old solution, and will therefore be retracted.
Unlike the above approach, we use abstract timing models and build a theoretical guide on top of them. In our approach, the effect of buffering is modeled by a linear-delay model, similar to [9, 18] . Our algorithms guarantee optimality under this delay model, and can also be used as a filter to identify a group of critical gates that may benefit from cloning. Even if our solution does not fix all timing problems, one can still apply more accurate gate placement techniques (such as [9] ) based on our sink partitioning and re-buffer on a small group of nets. In that way, success rate and the total turnaround-time will be improved.
The main contributions of this work are summarized as follows.
• We propose several polynomial-time optimal algorithms for simultaneous timing-driven cloning and buffering under a linear-delay model. Our algorithms will "see through" buffer trees in the original circuit.
• When the original gate is a movable object, an O(m)-time algorithm to compute the optimal cloning that maximizes worst slack is proposed, where m is the number of fanouts.
• When the original gate is a fixed object, an O(m log m)-time algorithm to compute the optimal cloning is proposed. Note that we assume the load-based cloning techniques have been used in the logic synthesis or early design stage, and we will not focus on the problem of reducing capacitive load. Also, buffering should have processed all high-fanout nets before the cloning we propose. The techniques in this paper are designed primarily for gates driving substantial interconnect delay (medium to long nets).
PRELIMINARIES
We outline our problem formulation as follows.
Linear Buffered-Path Delay Model
Timing driven buffering has become indispensable in timing closure [1, 6, 3] . To capture the effect of buffering in the cloning problem without applying dynamic programming based buffer insertion, it is best to use a buffer-aware interconnect delay model. Like previous work [2, 3, 9] , a linear-delay model is used in this paper. In a linear-delay model, the delay along an optimally buffered interconnect of length l is given by delay(l) = τ · l, where τ is a technology dependent constant. In general, τ depends on the buffer library size and the input slew. Therefore, for each technology, we perform optimal buffering on a long 2-pin net with the saturated input slew (which is the slew after several stages of optimal buffering). Subsequently, τ = delay(wire)/length(wire). For a multi-pin net, we break it into a set of 2-pin driver-to-sink edges and use the linear-delay model for each 2-pin edge. This simplification is generally valid since non-critical paths can be decoupled from the critical paths. With this model, the computationally expensive Steiner tree construction for multi-pin nets can also be avoided.
Empirical results in [3] indicate that linear-delay model is accurate with error within 0.5% when at least one buffer is inserted along the net. Results from [9] also show a 97% correlation between the linear-delay model and an industrial timing analysis tool.
Problem Formulation
The subcircuit for the cloning problem is a directed graph G = (V, E), where V = P ∪ F ∪ S, and E = (F × P ) ∪ (P × S). Vertex P is the target gate to be duplicated, F is the set of fan-in gates that drive P with size n, and S is the set of fan-out gates that P drives with size m.
1 Every gate g ∈ V is a logic gate performing certain logic functions, such as AND, OR, XOR but not buffers or inverters, and is associated with physical coordinates (X(g), Y (g)). If there are any buffers/inverters in the circuit that are fan-ins or fan-outs of P , we will look through them to find the first non-repeater logic gate. Each fan-out gate Si ∈ S, is associated with required arrival time RAT (Si) at its input pin, and each fan-in gate Fi ∈ F is associated with arrival time AT (Fi) at its output pin.
The location of each gate in S and F can not be changed in our problem formulation, and we refer them as "fixed" (not movable) gates. Note that these gates may be allowed to move in other transforms (e.g., legalization after cloning) but their locations are constrained for cloning itself to simplify the analysis. There are also cases that they are "fixed" by designers who want to keep certain gates in specified locations, or in the late stage of design flow, one prefers minimal perturbation to the design for the stability. Gate P may be movable or fixed.
After cloning, we create a duplicated gate for P , denoted by P . Finding a location for P is one objective of this work. The graph
. SP is the set of fan-out gates that P drives, and S P is the set of fan-out gates that P drives. In the G , each fan-in gate Fi is also connected to the duplicated gate P , but fan-out gates S are divided into two disjoint sets SP and S P . We refer to the division of S into SP and S P as a sink partitioning, and SP and S P as sink partitions. All other notations in G are valid for G .
For each edge e = (g1, g2) in G and G , the Manhattan length of edge e is dis(e) = |X(g1) − X(g2)| + |Y (g1) − Y (g2)|, where g1 ∈ F ∪ P ∪ P , and g2 ∈ P ∪ P ∪ S. Recall that all multi-pin nets will be broken into 2-pin nets with a linear-delay model. For each edge e, edge delay is D(g1, g2) = τ · dis(e). Each edge is also referred as a "net" where g1 is the driver, and g2 is the sink.
For gates P and P , we denote their gate delays by D(P ) and D(P ), respectively. In this paper, we treat these gate delays as constants. This is fairly accurate since we maintain that buffering must be performed with cloning, and after that, the load of P and P will remain almost the same. Gate sizing can be performed before or after cloning if the original driver is too weak or strong, which will further control the error of this constant gate delay model.
For a gate g in P ∪ P , the required arrival time at the output pin of g is RAT (g) = min
is the set of its fan-out gates. The arrival time at the output pin of g is AT (g) = max
Without loss of generality, we set gate delays D(P ) and D(P ) to zero in the following discussion to simplify the analysis. All algorithms are still valid as long as gate delays are constants.
It is easy to see that the slack of P and P determines the slack of the subcircuit G and G . For subcircuit G, we have Q(G) = Q(P ), and for G , we have Q(G ) = min{Q(P ), Q(P )}. For each edge (net) e = (g1, g2) ∈ E ∪ E , we also define the slack of e as
(1)
Cloning Problem: Given a graph G = (V, E), where P is the target gate, RAT for all fan-outs Si, AT for all fan-ins Fi, and a linear-delay constant τ , create a cloned gate P for P , which obtains a new graph G , find SP , S P and locations of P and P (if P is movable) such that Q(G ) is maximized. In contrast to most previous work ( [16, 12] ) which only identifies the partitions SP and S P , our algorithms will not only provide a partitioning of fan-outs, but also the placement of P and P . In addition, if the solution is worse than the original subcircuit, no cloning will be performed.
FAST TIMING-DRIVEN GATE CLONING
In this section, we present our algorithms for the cases where P is movable and P is fixed. We start with several new concepts.
Best Region and Best Arrival Arc Segment
Recall that the set of fan-in gates F is connected to both of original gate P and the duplicate gate P after cloning. The set of fan-out gates S is split into two disjoint sets (partitions) SP and S P such that S = SP ∪ S P .
Let treat the whole circuit image as a 2D plane H. For each fan-in gate Fi in F , the arrival time at any point v in the H is (Fi, v) . Therefore, if we place a gate at v with the fan-in set F , according to the static timing analysis rule, we have
Clearly, AT (v) a 2D function, where the variable is the locations of v. Define the set of points minimizing AT (v) on the plane H as
So K(F ) is the set of points which have minimum arrival time for all fan-ins. In the following, we will show that K(F ) is either a single point or a line segment with 45
• degree slope. Refer to Fig. 2 and 3 for examples of K(F ).
An example of arrival time arc
If there is only a single fan-in, it is obvious that K(F ) is the point of this fan-in itself, with AT = AT (F ). If there are two fan-ins F1 and F2, then there are 3 cases,
Here vF i refers to the location of fan-in gate Fi. In the first case, where the difference between the arrival time at F1 and F2 is smaller than D(F1, F2), K(F ) is a Manhattan Arc, which is a segment with slope 45
• or −45
• in the bounding box of F1 and F2. This slope will always be 45
• as long as technology dependent coefficient τ is a constant. Note that when F1 and F2 are either horizontally or vertically aligned, K(F ) is a point, which is a degenerate case of Manhattan Arc. For the other two cases, where one of the arrival times dominates the other one, K(F ) is at the location of one of the fan-in gates.
Denote K(Fi) as the set of points minimizing AT (v) for fanins F1, . . . , Fi. If we have more than two fan-ins, we will first form K(F2) for F1 and F2, and then merge K(F2) with F3 to get K(F3). K(F3) will be another Manhattan Arc or a single point, depending on the relationship among AT (K(F2)), AT (F3), and dis(K(F2), F3) which is the shortest Manhattan distance between F3 and K(F2). Repeating this procedure for all fan-ins, we can find the final K(F ). This bottom-up merging process is very similar to the Deferred-Merge Embedding (DME) algorithm in clock tree construction [19] though the formula there is to get a zero skew arc. With a similar procedure to the one shown in [19] , it is not hard to prove that K(F ) is always a Manhattan Arc or a single point, and our merging process guarantees the optimality that K(F ) will have minimum arrival time for all fan-ins. The detailed proof is omitted here due to space limitations.
In the rest of paper, we denote K(F ) as arrival time arc (a point can be considered a degenerate case of an arc), and the arrival time on this arc as AT (K(F ) ). An example of K(F ) for two fan-ins is shown in Fig. 2 .
Similarly, we can find K(S), the set of points maximizing RAT (v) on the plane H, for the set of fan-outs S. We denote K(S) as required arrival time arc and the required arrival time on this arc as RAT (K(S)). Refer to Fig. 3 for a K(S) . With similar procedure in [19] , it is also easy to prove that computation of K(F ) and K(S) takes O(m) time assuming m and n are in the same order. Also, given any order of fan-ins and fan-outs, denote K (Fi) O(1) .
The next Lemma states that for any point in the plane, its arrival time (required arrival time) can also be represented by the arrival time at the K(F ) (K(S)) and the shortest Manhattan distance between the point and K(F ) (K(S)). The proof is straightforward from the merging process and the fact that computation of arrival time (or required arrival time) is a max (min) operation.
LEMMA 2. For any point v in the plane
Now we will introduce the concept of Best Region. Define Z(F, S) as a region formed by K(F ) and K(S), It is easy to show that for any point v outside region Z, it will have dis (v 
, K(F )) + dis(v, K(S)) > dis(K(F ), K(S)), and no point exists in H with dis(v, K(F ))+dis(v, K(S)) < dis(K(F ), K(S)).
Also, all points in region Z will have the same slack
The following theorem states the property of region Z(F, S). PROOF. If P is located outside of region Z with a bigger slack, then based on Lemma 2 we have
dis(K(F ), v) + dis(K(S), v)) < RAT(K(S)) − AT (K(F )) − τ · dis(K(F ), K(S)) < Q(Z(F, S)),
which contradicts the assumption.
We refer the region Z as Best Region since it gives the region with best slack. We also refer the above procedure to find Z as Fig.  3, B(F, S) is K(F ) in Fig. 3 (a) , a single point in Fig. 3 (b) , and a partial segment of K(F ) in Fig. 3 (c) . From Theorem 1, we know that every point on B(F, S) still achieves the maximum slack. Define the slack on B
FindBestRegion. The complexity of FindBestRegion is O(m) since the only cost is to compute K(F ) and K(S).

THEOREM 2. FindBestRegion finds Best Region Z in O(m) time for a net with m fan-outs.
Now we introduce the concept of Best Arrival Time Arc. We define Best Arrival Time Arc B(F, S) as the intersection of K(F ) and Z(F, S). B(F, S) is part of K(F ), while the detailed shape is decided by K(F ) and K(S). In examples illustrated in
(F, S) as Q(B(F, S)), and we have Q(B(F, S)) = Q(Z(F, S)).
In next section, the concept of B(F, S) is used to design our algorithm.
When the Original Gate is Movable
In this section, we are going to present the algorithm when the original gate P is movable. The main idea is to limit the solution search space to K(F ), and then find Best Arrival Time Arc B(F, SP ) and B(F, S P ) efficiently by dividing plane into 6 regions (Fig. 4) and using the unique properties of fan-out slack of each region to find the best locations of P and P .
When P is movable, we are free to place both P and P . From Section 3.1, given a partitioning SP and S P , we can simply place P and P on the best arrival time arc B(F, SP ) and B(F, S P ) to achieve the optimal solution. The goal is to find the best partitioning SP and S P , which gives best slack among all possible partitionings. However" without the knowledge of partitionings,
B(F, SP ) and B(F, S P ) are not known to us.
An important observation is that both arcs need to be on K(F ), which is known. Therefore, rather than trying all partitionings, we will limit our solution search space for both P and P on K(F ), which enables the efficient computation as well. This is the key to drive the partitioning and computing best arrival time arcs. By limiting P and P on K(F ), we have AT (P ) = AT (P ) = AT (K(F )).
LEMMA 3. If arrival time arc K(F ) is a single point, no cloning is needed.
PROOF. If K(F ) is a single point, then B(F, SP ) = B(F, S P )
is a single point. One can place P at B(F, SP ) and achieve the maximum worst slack without cloning.
As stated in Section 1, the assumption here is that the cap load of the gate is reasonable and no capacitance-based cloning needed. Now we discuss the case when K(F ) is a Manhattan Arc. Since both P and P are movable, we use P as the example in the following discussion. Without loss of generality, we assume K(F ) is a 45
• line segment (all analysis for the −45
• case are similar), as shown in the Fig. 4. ;; For any fan-out gate Si in each region, we then analyze the relation between the slack of the edge (net) e = (P, Si) and the location of P on K(F ). Note that Q(e) is purely determined by
Fig . 5 shows the typical curves of edge slack vs. location of P on K(F ) for each region. The horizontal coordinate is the distance along the line segment from point i to point j. For example, if a fan-out is located in region H1, then when P is located at i, we will get the maximum slack for this net, and when P is located at j, we will get the minimum slack for this net. When a fan-out is located in H2, then when P is located from i to a certain point on K(F ), the slack will be same, and begins to decrease when P is moving towards j. If we intersect all slack curves in the set SP (S P ), a minimum slack curve can be generated by taking the minimum slack among all slack curves for each point on K(F ). The segment with maximum slack on this new curve will be the best slack we can achieve for this set of fan-outs. This segment is either a level segment or a single point from We have now two cases, namely, whether there are fan-outs in region H6.
When there are no fan-outs in region H6, from Figure 5 , by putting all fan-out gates from H1 and H2 in one set (say SP ), and all fan-out gates from H3 and H4 in another set (say S P ), Best Slack Segment for each set is the maximized since it avoids potential intersection (i.e. fan-outs from H1 and H3). In that case, B(F, SP ) is a line segment on K(F ) starting from i, and B(F, S P ) is a line segment on K(F ) starting from j. Fan-out gates from H5 can be put in either set and do not affect the results since for every point in region H5, the distance to all locations on K(F ) are same. This partitioning is one of the best partitionings and achieves the best slack. PROOF. If we put all fan-out gates from H1 and H2 in SP , all fan-out gates from H3 and H4 in S P , and all fan-out gates from H5 in either set, we have an optimal partitioning. One of the optimal placement solutions places P at i and P at j. The time complexity is O(m), which is the time of computing K(F ). The case when the slope of K(F ) is −45
• can be proved similarly. H6 ∪ H1 ∪ H2 (or H6 ∪ H3 ∪ H4) , no clone is needed and the optimal slack can be achieved by placing P at j (or i).
From Lemma 4, it follows that LEMMA 5. If P is movable, and no fan-out gates are located in the region
Now we present the general algorithm to when there are fan-outs in region H6.
A slack curve as shown in Figure 5 for any region H1, H2, H3, H4, H5 and H6 can be regarded as a trapezoid-like curve (referred to as trapezoids for notational convenience henceforth) or a degenerate case (e.g., a line segment) of a trapezoid. Consider a graph containing slack curves corresponding to all fan-out gates. In the following, a side of a trapezoid will be called a line segment. The slope of any such line segment is of 0
• or τ or −τ . A 0
• line segment in a trapezoid is called a level segment. In the degenerate case where the slack curve is a single line segment, the level segment is defined as the end point with maximum slack.
In all trapezoids, we first find the rightmost τ -slope line segment and the leftmost −τ -slope line segment. For example, left (resp. right) side of the dotted t1 (resp. t2) in Figure 7 (a) shows the rightmost τ -slope (resp. leftmost −τ -slope line segment). The line segment of a trapezoid is rightmost (resp. leftmost) if no line segment of the slope is to the right (resp. left) of the line segment. The leftmost and rightmost line segments can be found in linear time. First note that any point in a slack curve for fanout Si refers to the net slack Q(P, Si) when placing P along i, j as defined in Figure 4 . Given a single slack curve t1, the best slack it can achieve is the slack corresponding to the level segment. To achieve it, one can place the gate anywhere on that level segment.
Case 1: When the rightmost τ -slope line segment and the leftmost −τ -slope line segment do not intersect, as shown in Fig. 7(b) , the lower level segment of all trapezoids, which is the Best Slack Segment, determine the maximum worst slack and no gate duplication is needed. Note that in this case, pure line segments in regions H1, H2, H3 and H4 are considered as well, since they are degenerate cases of trapezoids. One can just place P anywhere on that level segment and achieves the best slack.
Case 2: When the rightmost τ -slope line segment and the leftmost τ -slope line segment intersect, first find the trapezoids that these two line segments belong to. Without loss of generality, the identified trapezoids are as t1 and t2, respectively, in Fig. 7(a) . We compute the intersections of all other trapezoids with t1 and t2, and put them into the sets SP and S P formed by t1 and t2, respectively. All other trapezoids can be divided into three groups.
Group A: For any trapezoid not intersecting any of t1 and t2, called zero-intersecting trapezoid, we arbitrarily assign it to a set. The zero-intersecting trapezoids will not impact the worst slack. Note that if all trapezoids other than t1 and t2 are zero-intersecting trapezoids, the lowest level segment in t1 and t2 are Best Slack Segment of each set.
Group B: For any trapezoid intersecting only one of t1 and t2, called one-intersecting trapezoid, we can always assign it to the opposite set (formed by the line segment not intersecting with it). For example, the trapezoid t3 in Fig. 7(a) only intersects t2 and it is assigned to SP formed by t1. The one-intersecting trapezoids will not impact the worst slack as long as they are assigned appropriately. Note that if all trapezoids other than t1 and t2 are one-intersecting trapezoids, the lowest level segment in t1 and t2 are Best Slack Segment of each set.
Group C: For any trapezoid intersecting both of t1 and t2, called two-intersecting trapezoid, we have two intersecting points. A twointersecting trapezoid will be assigned to the set containing the higher intersecting point. For example, both t4 and t5 are assigned to the SP formed by t1. One then needs to find the two-intersecting trapezoid with lowest level segment such as t4 in Fig. 7(a) . Subsequently, the lowest level segment in t1, t2 and t4 determines the Best Slack Segment. In Fig. 7(a) , Best Slack Segment for P is with t2. This means that P can be anywhere between a, b and P can be anywhere between c, d. For the partitioning of the set of fan-out gates S, P will connect to SP which contains all the trapezoids assigned to SP determined by t1, and P will connect to S P which contains all the trapezoids assigned to S P determined by t2. Note that the lowest level segment of a two-intersecting trapezoid can be lower than the intersection of t1 and t2, see, e.g., t5 in Fig. 7(c) . However, it will not impact our algorithm. This just means that one cannot improve the slack by gate duplication since the worst slack is determined by the level segment of t5.
The algorithm is optimal since the above two cases cover all possible situations and in each situation, it is easy to see that the optimal solution is computed. In the algorithm, one needs to first compute the rightmost τ -slope and the leftmost −τ -slope line segment. If they do not intersect, the slack is determined by the lower level segment. Otherwise, for each of the remaining m − 2 trapezoids, compute its intersections with t1 and t2. Assign the trapezoids to partitions accordingly based on their groups. For a two-intersecting trapezoid, one also needs to record its higher intersection point. Next, find the trapezoid with lowest higher intersecting point (e.g., t4 in Fig. 7(a) ), which takes linear time. One can then immediately find the maximum possible worst slack the circuit can achieve by comparing it with the level segment of t1 and t2. Clearly, the above algorithm runs in linear time. with the above trapezoids and assign them to SP and S P accordingly. Compute P and P as above. 13:Return the location of P and P , SP and S P ; End.
When the Original Gate is Fixed
When the original gate P is fixed, the above algorithms do not work since we can not expect P to be placed on the arrival time arc K(F ). Let us assume all fan-outs in S is sorted in the nonincreasing order of RAT (Si) − D(P, Si).
LEMMA 6. There are at most m unique Q(P ) values if P is fixed.
PROOF. Since P is fixed, AT (P ) and D(P, Si) are constant. Then for all possible partitionings, Q(P ) can only be one of the value among RAT (S 1 )−D(P,S 1 )−AT (P ), RAT (S 2 )−D(P,S 2 )−AT (P ), . . ., RAT (Sm)−D(P,Sm)−AT (P ).
The above Lemma states that if fan-out Si is in SP , then we can put all fan-outs Sj, where j < i into SP , and Q(P ) does not change. With Lemma 6, we can start with putting S1 in S(P ), while putting all other gates in S(P ), and get the worst slack of Q(P ) and Q(P ). If Q(P ) ≥ Q(P ), we can stop since this is the best slack we can get. If not, we can put S1 and S2 in S(P ), which will decrease Q(P ), but may increase Q(P ). Again, if Q(P ) ≥ Q(P ), this will be the best slack we can get since further process will further decrease Q(P ), and give smaller worst slack for the whole subcircuit. The pseudo-code of the algorithm is shown as follows.
The Compute Q(P ) and Q(P )
break. 7: Compare the solution with Qori and return the location of P , SP and S P ; End.
One note is that one can disconnect all sinks and just let P drive all fan-outs and move it, which is similar to RUMBLE [9] , and we can compare the solution with the above results and find the best one. 
EXPERIMENTAL RESULTS
To show the effectiveness of cloning, especially compared to other optimizations, we first create 100 random testcases in the 45nm node (which means the logic gates and buffers are taken from a 45nm library). We randomly created subcircuits with different fan-in and fan-outs and placed them in a region with the bounding box size ranging from 1mm to 15 mm. The number of fan-ins range from 2 to 4, and the number of fan-outs range from 2 to 8. We choose 16 buffers and inverters for the buffer insertion.
We implemented 4 different optimizations including cloning as follows, to show the benefit of our techniques. They are
• Buffering: Timing driven buffer insertion [6] . This data is treated as the baseline and the data of all other optimizations is compared to this one.
• RUMBLE: Moving the original gate and rebuffering as described in [9] . • Clone1: Our cloning algorithm when the original gate is fixed.
• Clone2: Our cloning algorithm when both the original and duplicated gates can be moved. To be fair, for RUMBLE, Clone1, and Clone2, we always first run buffer insertion before the optimization. The results are also compared to buffer insertion results (which means "Buffering" is the baseline). This is to guarantee that any improvement we see from our techniques, are due to cloning instead of pure buffering on the original net. In addition, we also use the RUMBLE algorithm inside our cloning algorithms to determine the best gate location after a partitioning is fixed. For each partition, we will run RUMBLE to find the gate location and slack, and then choose the best solution for all partitions derived from our algorithm. Note that it is only for the comparison purpose, and one can still apply our algorithm first to find the best partitioning and only apply the RUMBLE algorithm once.
All algorithms including buffering and RUMBLE are implemented in C++ and tested on an AMD Opteron computer with 2.8GHz CPU and adequate memory. For the cloning, we did the full optimization steps, including ripping up buffer trees for the subcircuits, duplicating and placing the gates, re-buffering and legalization. For RUMBLE, we also rip up buffer trees and place the original gate in the new location. Also, we use an industrial static timing analysis engine for the timing analysis. For the rebuffering, we implement the buffering algorithm in [6] to get best timing-area tradeoff, and buffer tree is constructed to be placement congestion aware.
To clearly illustrate the impact of each optimization, we first choose one subcircuit and show the circuit layout after each optimization from Fig. 8(b) to Fig. 8(d) , where Fig. 8(a) shows the original circuit without buffering. The Manhattan distance between S1 and S2 is 13 mm. The timing information after each optimization algorithm is shown in Table 1 . It clearly shows that the benefit of buffering, RUMBLE, Clone1 and Clone2 approaches. Clone2 gives the best results in terms of worst slack and FOM. Clone1 is still better than RUMBLE and get the same worst slack as Clone2, but can not do better for S2. RUMBLE achieves better slack than pure buffering by placing the original gate in the middle, however, it sacrifices the slack at S1 for S2. Note that the slack of S1 and S2 are not exactly same as for RUMBLE and Clone2, it is due to slew impact, the buffering topology chose from placement congestion aware buffer-tree algorithm which considers the placement density, as well as the order of buffer insertion for all the nets which results in asymmetric timing constraints. Fig. 8 .
Optimization Slack at S1 (ns) Slack at S2 (ns) Buffering (Fig. 8(b) ) -2.855 -2.206 RUMBLE (Fig. 8(c) ) -2.410 -2.403 Clone1 (Fig. 8(d) ) -1.606 -2.076 Clone2 (Fig. 8(e) ) -1.606 -1.590
For the rest of the circuits, due to space limitations, we list the top 10 subcircuits with the best improvement due to cloning with detailed information. The results are shown in Table 2 . For all experiments, we present worst slack (WSLK) improvement over "Buffering", Figure of Merit (FOM, the sum of all negative paths) improvement over "Buffering", final area and wirelength, where Buffering experiment serves as the baseline. The area includes the original fan-in gates, fan-out gates, cloned gate and buffering area. We also list the summary results of all 100 subcircuits in Table 2 by averaging all metrics. The runtime for all testcases is pretty fast, less than 5 seconds, including all static timing analysis, buffer insertion, linear programming inside RUMBLE, I/O processing and model build-up time.
The table clearly shows the same trend as shown in Fig. 8 . In terms of worst slack, RUMBLE is better than buffering, and
