Abstract-In this paper, we will study the construction of a Steiner routing tree for a given net with the objective of minimizing the delay of the routing tree. Previous researches adopt Elmore delay model to compute delay. However, with the advancement of IC technology, a more accurate delay model is required. Therefore, in this paper, we will use two-pole delay model to compute the cost function of a Steiner tree. Moreover, we propose a new algorithm to construct the Steiner tree. Our algorithm takes into consideration the net topology, the total wire length and the longest path from the source to sink. Experimental results show that our algorithm is very effective and efficient as compared to [ 8 ] .
I. INTRODUCTION
With the development of deep sub-micro technology, interconnection delay has become a very important factor in performance driven design. Accordingly, effective and efficient method to accurately estimate the interconnect cost of a given net is essential.
The interconnect cost of a net was first modeled as the cost of a Steiner tree of a net in [I] which can be defined as one to estimate the area or the delay of a given net. In the case of minimizing the delay of a net as shown in many researches [2, 3, 4, 5, 6, 7, 8, 91 , cost function can be defined to minimize the total wire length, the radius (longest source to sink path length), or the delay value of the net based on some delay model.
In 12, 3, 4, 5, 6, 71, they found that the delay of a net is related not only to maximum interconnection path length but also the total wire length. Hence, a Steiner tree is constructed to reach the target of minimizing the total wire length under a prescribed radius bound. But, these approaches did not accurately estimate the delay of a net because they simply considered path length in the computation of delay. In effect, these approaches use lumped RC delay model. Elmore delay model [lo] was a more accurate one than lumped RC delay model. In analyzing Elmore delay in RC trees, it is found that constructing the Steiner tree with minimum delay should consider the source-sink longest path, the total wire length based on net size and interconnect topology. Accordingly, the Steiner tree construction using this delay model can more accurately reflect real delay. Research in this regard includes [8, 93 where Elmore delay model is used as a cost function to construct a Steiner tree. However, in [8], only a greedy method 126 was proposed. It will lose an optimal solution when the net size is large. In [9], a branch-and-bound method was adopted to find the optimal solution. The disadvantage of this approach is that running time is exponential to the net size. Therefore, we will propose a Steiner tree construction method which is designed to take also the longest source-sink path length, total wire length, and net topology into consideration. However, OUT algorithm will produce results close t o the optimal solution [9] at much less time.
With the advancement of IC technology, the effect of inductance on a wire should not be ignored in calculating the delay of a sink. Two-pole delay model [11, 12, 13, 141 and three-pole delay model [15] which take inductance into consideration have been proposed to calculate delay. These delay models are more accurate than Elmore delay model. For example, Elmore delay is just a special case when inductance is not considered in two-pole delay model. Therefore, in this paper, we will use two-pole delay model to compute the cost of a Steiner tree.
The remainder of this paper is organized as follows. Section 2 formulates the problem of minimumdelay Steiner tree using two-pole delay model. Section 3 presents our tree construction algorithm. In Section 4, some heuristic methods will be presented to improve the routing results when some special cases occur. Section 5 gives some benchmarking results where the delay of the routing tree, total wire length, and running time will be compared. Finally, Section 6 concludes this work.
PRELIMINARIES
In this section, we will give our problem definition and review two delay models.
A. Problem definition
Given a network G = (V, E)(lVl= N ) on where Ne denotes the number of the wire segments on the path between no and ni, C , the tree capacitance of T, where T, denotes the subtree of T rooted a t x and R, the resistance of the wire segment including rd(the output driver resistance at the source).
The analysis of Elmore delay tells us that CO is proportional to the total wire length, and re and c, are proportional to the length of the edge e. Therefore, using Elmore delay model, the delay can be decreased by reducing the total wire length and the length of the path. Moreover, the topology of the Steiner tree affects the value of the Elmore delay.
C. Two-pole delay model Although Elmore delay model has a compact definition and can be quickly computed, it does not capture all factors that count for delay. That is, Elmore delay model ignores the effect of inductance on the wire. Two-pole distributed RCL delay model is a moment-based methodology and take inductive effect into consideration. The accuracy and computational efficiency of this delay model is between SPICE and Elmore delay.
Here, we omit the analytical and inductive process of the two-pole delay model and directly adopt the final conclusion. For two-pole delay model, the delay t~~( n i ) at sink ni is defined as follow: 
N.

3-
where Ne denotes the number of wire segments on the path between no and n,, R I , Li and Cj' denote the resistance, inductance and capacitance of the wire segment and each wire segment is modeled using lumped RLC segment. In this equation, at any node j , the total capacitance C1' is given by c; = c, if no off-path subtree at node j if node j has off-path subtree T ( J )
where C, is the capacitance at the node and CT(,) is the off-path subtree capacitance at node j. The unique path from the source no to sink ni is referred as the main path and the edges not on the main path are referred as the offpath edges. From [lo], we know the condition for the case of the real poles is (4M2 -3 M f ) > 0, the condition for complex poles is (4M2 -3M;) < 0, and (4M2 -3 M f ) = 0 is for double poles. Comparing Elmore delay model and two-pole delay model, it can be found that Elmore delay model equals to bl of two-pole delay model. We also know under the condition of real poles, Elmore delay is close to two-pole delay. Moreover, two-pole delay is 1.95 times to Elmore delay for double poles. However, for complex poles, b2 is a dominating factor and Elmore delay is very different from two-pole delay.
From the equation of two-pole delay model, we can see that b2 is determined by the path length of the sink, the total wire length and the topology of the tree similar to Elmore delay model. Therefore, we will use these characteristics of Elmore delay model and two-pole delay model to develop our algorithm and the equation of two-pole delay model to compute cost function.
ALGORITHM
In the construction of Steiner tree, it is allowed to introduce intermediate junctions called Steiner nodes to connect the sinks of a net. If all points on a routing plane are allowed to be Steiner points, the number of Steiner candidate nodes is infinite. However, Hanan proved in [16] that there exists a Steiner tree with minimum total wire length in which all Steiner nodes are chosen from the intersection nodes (called Hanan grid) of vertical and horizontal lines drawn through the source and sinks of a net. Hence, in the following discussion, we will consider to construct a minimum cost Steiner tree confined to using Hanan grids as Steiner nodes.
In a Steiner tree, there are at most four wire segments connecting to the source: wire segments right to the source, left to the source, top to the source, and bottom to the source, and each sink is connected to the source through one of four segments as shown in Figure 1 . Therefore, all sinks can be partitioned to four sets according to the wire segment through which the sinks are connected to the source.
Let VO, V I , V2, and V3 be subsets of V, where VO, V I , V2, and V, are sinks connected to the source through the right wire segment, the top wire segment, the left and the bottom wire segment, respectively, VO U Vl U V2 U V3 = V, and V, n V, .= {no} for all subsets i and j. Each of four subsets can independently connect all its nodes and form a subtree, called p-tree of T . Therefore, a Steiner tree consists of four p-trees with respect to four wire segments of the source. It is obvious that the delay of the whole Steiner tree, DelayT(V), is the maximum delay of the p-trees constructed for each vi. Given that the maximum delay of nodes in vi to be delay(K). The following observation is true.
Figure 1: Four wire segments connecting to the source From Observation 1, we know that the delay of the whole Steiner tree can be reduced if the delay of the subtree with the maximum delay is reduced while the delays of the other three subtrees are increased to not more than the maximum delay. Then, we have the following observation. a Based on Observation 1, our algorithm begins with partitioning all sinks into four subsets VO, V I , Vz, and V3. The sinks are partitioned to a subset according to their shortest distances to x-axis and y-axis of the source. This partitioning is just a heuristic and the reason behind this heuristic is that the shortest distance to x-axis or y-axis may result in minimal total wire length because of the longest overlap between two paths. In the later steps, this partitioning is incrementally improved. Moreover, from Observation 1, we know that the delay of the whole tree is determined by the maximum delay of the four p-trees. However, p-tree needs t o be constructed for each subset before we compute the delay of each subset. PFA [4] algorithm is here used to construct the p t r e e for the nodes in each subset q. The reason why PFA Algorithm is thus adopted is that PFA constructs a Steiner tree with the shortest path from the source to all sinks in vi. Then, the delay is calculated for the four p-trees using two-pole delay model. Finally, the delay of the whole tree which is used as the cost function of this partition] is computed as the maximum delay of the four ptrees.
From Observation 2, we know that the delay of the whole tree can be decreased by reducing the delay of the p-tree with the maximum delay of the four p-trees. Let the subset V, have the maximum delay. From the analysis of the delay model in the previous section, we know that three factors affect the delay of a sink, which are the longest path length of the sink, topology of the subset containing the sink, and the total wire length of the whole tree. Among them, the shortest longest path length of V , can be satisfied because PFA algorithm is used to construct p-tree. As for the other two factors, we utilize two methods to reduced the delay of V, by reducing the size of subsets V , and the total wire length. First, we can move nodes in V, to other subset SO that the topology of p-tree for V, is changed and the total capacitance computed for V, is reduced. Second, if the topology of V, is not changed, we can reduce the total wire length by moving the nodes among Vi for i # m. The first and second methods are developed to local and global searches of our algorithm, respectively. Before the local and global search of each iteration begins, the subtree with maximumdelay is computed. Then, in local search, we select one node to move from V, to other subset. Because we want to maintain the path from source to each sink being shortest, for each sink, it can appear in at most two subsets. For example, if a node k is located in the first quadrant with the source as origin, node k can be connected to the right or top wire segments of the source only. That is, node k can only be in the subset Vo or VI, We say that V,, VI are the active sets of k.
If the node k is located on the axis of the source, there is only one active set for k.
For each node k in V,, if there exists other active set V,, k is moved to V, and the delay is calculated for all Vi using two-pole delay model. Then, we choose the node in V, whose moving to other subset results in the delay of V, decreasing most with the condition that the reduction of delay of V, will not result in the delay of all other larger than that of V, .
In such a way, the delay of the whole tree can be reduced. We repeat the local search until no node in V, can be moved to reduce the delay of Note that sometimes when computing the cost of moving one node to its other active set, we compute the cost of moving a group of nodes. We use an example to explain this case. In Figure 2 , let n2 be selected to moved from Vo to V3. The wire segment connecting n2 and y-axis will intersect the wire segment connecting 711 and x-axis. To avoid the intersection, nodes nz and nL will be moved as a group if node nz is selected. Hence, the cost to move n 2 is the cost of moving n2 and n1.
In global search, we move nodes in all subsets Vi. In each iteration, for each node in a subset x, if it has other active set, we calculate the delay of the whole tree for that move and use it as the cost function. Then, we choose the move which has the minimum delay of all moves. We repeat the iterations until no node can be moved. After the iterations, a sequence of moves are generated. Then, the principle of K-L algorithm is used. Only the sequence of moves which result in the maximum delay reduction are actually performed.
The local search and global search are iterated until no improvement can be obtained. The pseudo-code of our algorithm is presented in Figure 3 . In each step of the algorithm, alterative method can be used to trade accuracy and efficiency. For example, in step 7, we can adopt Elmore delay model instead of two-pole delay model. Loss of accuracy will accelerate the calculation of delay. In step 9, global search can be performed by exhaustive search. Now, an example is shown in Figure 4 . First, all nodes are partitioned into four subsets according to their Vm . (1)
while ( Figure 4 (b). Delay of each p-tree which is used as the cost is calculated. Then construct the p-trees and calculate the costs of four subsets. The costs of VO, V I , Vz, and V3 are 2.25 0, 0, and 5.15, respectively. Since V3 has the maximum cost, V, is set to V3. Now, for each node in V3, we calculate the cost if the node is moved to its other active set. Let the node be moved be n1. If nl is moved to its other active set Vz, the cost of V3 is reduced to 4.29 while the costs of other subsets, VO, VI, and V2 become 2.22, 0 and 3.13. Since the costs of other three subsets are not larger than that of V3, moving n1 to V2 is a candidate for selection. Similarly, the costs of moving 7 2 2 , 724, ne, 727 and ng to their active sets are calculated and they are 4.08, 3.40, 4.82, 5.41 and 6.14. However, because moving 124 and ne will result in the delay of VO and Vz become larger than the delay of V,, n4 and n6 are excluded from the candidates for moving. Also note that if n2 is moved to V2, it will block ng to connect to V3. Hence, when calculating the cost of moving n2, both n2 and n g are moved to Vz. Therefore, in the first iteration of local search, since the cost of moving n2 is minimum, n2 and ns are selected to be moved to Vz. In the second iteration, we repeat the local search and calculate the costs for remaining nodes, nl, 724, n6 and n7, in V3. Since there is no improvement can be obtained, local search is ended. The result of local search is shown in Figure 4(c) .
In the global search, we calculate the costs of all sinks that have active set to move. The computed costs of n1, n2, 723, n4, 725, 726, n 7 and n n g are 7. 43, 6.14, 6.21, 5.96, 4.03, 4.48, 4.20 and 5.15, respectively . Note that when 72.5 is moved to V3, nodes n3, n4 and 127 are also moved as a group because moving n5 to V3 will block n3, n4 and n7 to connect to the x-axis. The node with minimum cost is n5. Hence, 725 and n3 are moved to V3 and n4 and n7 are left in V3. The iteration of global search continues to calculate the costs of remaining nodes, n l , n2, 126 and ng, and obtain the costs 4.80, 4.45, 6.66 and 6.61. Since np has the minimum cost, it is selected as the next node to move. The iteration repeats. Finally, we can get a sequence of moves and their costs as ( ( n 3 , n4, n5, n7}, 712 , nl, ne, ng} = {4.03,4.45,4.86,3.57,6.96}. Based on the idea of K-L algorithm, n1, nz, 723, n5 and n6 are selected to be moved. The result is shown in Figure 4(d) . We iterate local search and global search until no improvement can be obtained.
IV. IMPROVEMENT
The algorithms based on Hanan's theorem can not always find the optimal solutions [9, 171 although their solutions are within one percent of optimal for all cases as addressed in [9] . Therefore, Steiner tree constructed by PFA based on Hanan's theorem may not produce the optimal solution. Moreover, PFA requires that the construction of tree with all sinks connect to the source by shortest paths, which may result in less satisfactory solution. Therefore, we will propose heuristic algorithms to post-process solutions to gain further improvement. Figure 5 shows an example where the delay of Figure 5(b) is less than Figure 5 (a). In Figure 5(b) , because the paths from the sinks, 726, n7 and ng to the source are not shortest paths, it is impossible for PFA to construct this Steiner tree. Instead, PFA will generate a tree as shown in Figure 5 (a). To solve this problem, we propose a heuristic to improve the solution.
A. Non-shortest path problem
The heuristic begins with traversal of the tree. In the traversal, an edge is added to form a cycle and then some other edge in the cycle is selected to break the cycle. If the cost of adding an edge is less than that of the deleted edge and the maximum delay is not increased, this insertiondeletion is performed. For example, in Figure 5 find that adding the edge between n4 and 725 will result in a cycle and deletion of edge between no and n7 in the loop will result in the reduction of delay, this insertion-deletion is performed. The final result is shown in Figure 5(b) .
B. Connecting to empty-subset problem
When nodes appear non-uniformly, some subsets may be empty (or have very few number of nodes) and thus the delay of the subtree is very small. Take the net shown in Figure 6 as an example. In this example, V2 and V 3 are empty. In this case, we will not follow active set rule. Instead, we will move nodes in the set with maximum delay to those empty sets. The reason behind this heuristic is that although moving a sink node to a subset that is not its active set will result in long wire connection, it is still likely the delay of the whole tree will be reduced because the empty set has very small delay and has a lot of room for delay to be increased. is selected to moved to V3, the wire segment connecting n2 and y-axis will intersect the wire segment connecting n3 and the source. Therefore, 122 will not be selected. Now, we move node n3 to V3. Because the move will not cause any intersection of wires and the delay of the tree is reduced, n3 is moved to V3. The result is shown in Figure 6 (b).
V. EXPERIMENTAL RESULTS
Our algorithms are implemented in C language and run on Sun-Ultra Enterprise 150. Algl is implemented following the algorithm presented in Chapter 111. Alg2 is the same as Algl except that Elmore delay model rather than two-pole delay is used to compute the delay cost function of a Steiner tree. We compare our algorithms with SERT algorithm [8] which produces solutions that have delay within 5% as more than the optimal solutions based on Elmore delay model [9] .
In our experiments, examples are randomly generated. For a fixed number of nodes in a given net, 20 cases are generated and the average results of the 20 cases for three algorithms are compared. Four different technologies are used as shown in Table 1 , where IC2 and MCM are from MOSIS and AT&T, respectively [8] and Real and Comp are from [12] . Table 2 shows the delay comparisons where delay is computed using two-pole delay model. In this Table, column index is the number of nodes in a net and each entry gives the delay ratio computed using the delay of SERT as denominator. We see from the Table that the delay produced by Algl is 7%-25% less than that by SERT. Comparing results of Algl and A192 shows that minimizing Elmore delay of a Steiner tree does not equal to minimizing two-pole delay.
Table 3 also shows the delay comparisons. However, in this experiment, delay is computed using Elmore delay model. The comparisons of Alg2 and SERT demonstrates the effectiveness of our algorithm. In this Table, Alg2 produces better results than SERT algorithm when the net size becomes large and advancement technology is used.
The total wire length is compared in Table 4 . It shows that the total wire length of Algl and Alg2 have about 2.9% and 0.6% reduction in average as compared to that of SERT. Finally, running time is compared. In this experiment, 20 examples with 10, 20, 30, 40, 50 nodes in a net are generated and IC2 technology is used. Running times of 20 examples for each case are averaged. The result is plotted in Figure 7 . When the number of nodes in a net is less than 20, the running time of the three algorithms is about the same. However, when the number of nodes is increased to 40, the running time of AIgl and Alg2 are more than that of SERT. We can also note that two-pole delay spends much more time than Elmore delay from the comparisons of running times for Algl and Alg2. 
VI. CONCLUSIONS
In this thesis, we have proposed an algorithm to construct a Steiner tree with the objective of minimizing delay using two-pole or Elmore delay model. Using PFA algorithm and local, 'global search, our algorithm can find a steiner routing tree taking the net topology, the total wire length and the longest path from the source to a sink into consideration. Benchmarking results show that our algorithm is indeed efficient and effective.
