Abstract-Conventionally, the topology of signal net routing is almost always restricted to Steiner trees, either unbuffered or buffered. However, introducing redundant paths into the topology (which leads to non-tree) may significantly improve timing performance as well as tolerance to open faults and variations. These advantages are particularly appealing for timing critical net routings in nanoscale VLSI designs where interconnect delay is a performance bottleneck and variation effects are increasingly remarkable. We propose Steiner network construction heuristics which can generate either tree or non-tree with different slack-wirelength tradeoff, and handle both long path and short path constraints. We also propose heuristics for simultaneous Steiner network construction and buffering, which may provide further improvement in slack and resistance to variations. Furthermore, incremental non-tree delay update techniques are developed to facilitate fast Steiner network evaluations. Extensive experiments in different scenarios show that our heuristics usually improve timing slack by hundreds of pico seconds compared to traditional approaches. When process variations are considered, our heuristics can significantly improve timing yield because of nominal slack improvement and delay variability reduction.
I. INTRODUCTION

S
IGNAL net routing is an important part of VLSI circuit designs since it directly affects interconnect delay which is a well-known performance bottleneck. In practice, people almost always use Steiner tree [1] , either buffered or unbuffered, for signal net routing because it is cost-effective and its delay is relatively easy to compute. However, introducing redundant paths into the topology (which leads to non-tree) has some remarkable advantages compared to trees. Non-tree topology can often obtain significantly better timing performance than trees [2] , [3] . In addition, the redundant paths in a non-tree network provide certain tolerance to open fault and, therefore, can improve manufacturing yield and reliability [4] . Moreover, nontree topology sometimes can reduce delay variations [4] . Even though non-tree delay computation is more expensive than that of trees, the increasingly strong need for improving interconnect performance and fault/variation tolerance may force us to adopt non-tree topology for the timing critical nets. On the other hand, the computation overhead can usually be alleviated by the advancement on computation techniques and facilities. Previous transitions from power tree to power grid and from clock tree to clock mesh indicate that design needs often eventually outweigh computation overhead if the overhead is not prohibitively large.
Perhaps the first non-tree routing work is [2] which greedily adds extra wires on given trees to minimize source-sink delay. The later work of [3] inserts a link between the source and the sink having the maximum delay. The recent work of [4] is focused on another aspect of non-tree routing-reliability and manufacturing yield. It augments extra edges to an existing tree to increase the percentage of two-connected wires, which implies tolerance to open faults. The previous works [2] , [3] on timing driven non-tree routing have two main weaknesses. Since they add wires to existing trees, the performance of the resulting non-trees depend on the initial trees. Starting with an arbitrary tree cannot ensure if this tree can facilitate a good non-tree solution. The other weakness is that they [2] , [3] optimize only delay without considering timing constraints. In reality, maximizing slack or minimizing wire cost subject to timing constraints is a more common and useful problem formulation [5] .
The timing constraints in previous works [5] are almost always upper bounds for sink delays. In fact, there are delay lower bounds due to the short path (hold time) constraints in synchronous circuits. Some gate sizing works [6] consider both delay upper bound and lower bound at the same time. To the best of our knowledge, there is no signal net routing work considering the double-sided timing constraints yet. This is perhaps due to the reason that delay lower bound can be easily satisfied by padding extra delay. The delay padding can be implemented by wire detour, adding dummy capacitors or inserting redundant buffers. The former two approaches may increase the delay along the long path. The latter approach of redundant buffers may intensify the leakage power problem. Thus, the short path constraints need to be handled in a more careful manner.
We propose Steiner network construction heuristics which consider delay upper bound and lower bound simultaneously for timing critical nets. We will show that sometimes a link insertion can simultaneously reduce long path delay and increase short path delay. The first heuristic is a greedy link insertion in an existing tree, which is similar as [2] but the solution search is trimmed for the double-sided timing constraints. The second one is a dynamic programming based constructive algorithm which can generate a set of solutions with different slack-wirelength tradeoff and can reach either tree or non-tree topology. The third one extends the previous heuristic to handle simultaneous Steiner network construction and buffer insertion, which may provide further improvement in slack and resistance to variations. Incremental non-tree delay update techniques are also developed for improving the efficiency of our algorithms. Extensive experimental results show that our Steiner network con- struction, either buffered or unbuffered, usually improves slack by hundreds of pico seconds compared to the traditional tree results. Moreover, our constructive method almost always outperforms the greedy approach like [2] . The non-tree approach may bring some wirelength and runtime overhead, but the impact to overall chip design is very limited considering that it is applied to only a small amount of timing critical nets. When process variations are considered, Monte Carlo simulation results show that our methods can improve timing yield greatly because of both nominal slack improvement and delay variability (standard deviation) reduction.
The rest of this paper is organized as follows. Section II formulates the Steiner network construction problem. Section III presents the fast incremental non-tree delay update procedure. Section IV describes the proposed Steiner network construction algorithms without buffers. Section V describes the simultaneous Steiner network construction and buffering algorithm. Section VI presents the experimental results with analysis. A summary of work is given in Section VII.
II. PRELIMINARY
We will show that link insertion in an existing tree or non-tree may simultaneously reduce long path delay and increase short path delay under certain condition. Refer to Fig. 1 for an RC network, which can be either a tree or a non-tree. The Elmore delay from the source to a node in an RC network is given by , where is the ground capacitance at and is the transfer resistance which equals the voltage at node when 1 A current is injected into node and all the other node capacitances are set to zero [3] . Consider inserting a link between two nodes and in the RC network. Let the link resistance be and link capacitance be . This link insertion is equivalent to adding capacitance at node and , respectively, and inserting resistance between and .
Let denote the delay increase at due to adding link capacitor . Then and , where is the path resistance from the source to node and is their shared path resistance. After the link insertion, the delay to and are changed from and to and according to the following equations [7] :
where and . In general, and are equal to the Elmore delays at and , respectively, when node capacitance , and the other node capacitances are set to zero [3] .
The previous equations show that the link capacitance always increases signal delay while the link resistance attempts to average the delay between and . It is straightforward to derive the following condition on the simultaneous improvement for both long path and short path delay.
Lemma: If a link with resistance and capacitance is inserted between a node on a long path and a node on a short path in a Steiner network, the necessary and sufficient condition of simultaneously reducing delay to node and increasing delay to node is . Proof: First note that both and are positive. To reduce long path delay, i.e., , we need , which is . To increase the short path delay, we need . This is true since and
When consider double-sided timing constraints, each sink has a delay upper bound and a delay lower bound . The delay upper bound is the same as the required arrival time (RAT) in traditional methods. We define the late slack of a sink as , where is the delay. Similarly, the early slack of a sink is defined as . The slack of a sink is . The late slack, early slack, and slack of a network (or subnetwork) are the minimum late slack, early slack, and slack among all sinks in the network, respectively. For a network (or subnetwork), the sink having the minimum late (early) slack is called late (early) critical sink. Here is our problem formulation.
Timing-Driven Steiner Network Construction: Given a source node , a set of sink nodes with each sink having load capacitance , lower delay bound and upper delay bound , construct a rectlinear Steiner network spanning the source, and the sinks such that the slack of the network is maximized.
The previous problem is solved in Section IV. We also consider the problem of simultaneous buffer insertion and Steiner network construction in this paper. That is, the network construction process also involves buffer insertion. Such an algorithm is given in Section V.
III. INCREMENTAL NON-TREE DELAY UPDATE
One hurdle of using non-tree topology is the complexity of computing its delay. The first-order delay model for an RC network is , where is the node delay vector, is the conductance matrix, and is the node capacitance vector. Frequently, performing matrix inverse operations in an optimization procedure may consume a lot of runtime. In this section, we propose incremental non-tree delay update techniques. If the node capacitors are treated as current sources, the first-order delay computation is equivalent to dc analysis in a linear circuit. Therefore, we introduce these techniques in the language of dc analysis for the convenience of presentation. We start with a simple example of linear dc circuit in Fig. 2 (a) without loss of generality. Using the standard modified nodal analysis (MNA) analysis, a set of KCL/KVL equations can be derived for the circuit as (3) where (4) is the conductance matrix, is the dc solution, is the input, and links the inputs to the circuit. The number of circuit unknowns is denoted as which is 5 in this case. The current of the voltage source can be expressed as , where . In the following subsections, we use the notation to describe a linear dc circuit. Given a linear dc circuit and the LU factor of the conductance matrix, we will show how to efficiently update the dc solution if an incremental change is made to the circuit. Since the update after link insertion has been discussed in [8] , we focus on the update for the other incremental changes.
A. Reroot
Reroot means the root (voltage input) of a subnetwork is changed from one node to another. We illustrate this operation by moving the voltage source from node 1 [in Fig. 2(a) ] to node 2 [in Fig. 2(b) ]. In Fig. 2(b) , a current source is added to node 1 to reflect the node capacitance there. To solve the new dc circuit, an LU factor will be needed for the updated matrix. The updated conductance matrix can be obtained from via a rank-4 update , where and are and matrices, respectively, given by Using the well-known matrix inversion formula [9] , the inverse of can be obtained from that of as (5) where has a dimension of 4 4, therefore, it is inexpensive to invert. Notice that the inversions of and are never formed explicitly. Instead, only the LU factor of is reused to solve any linear system defined by efficiently.
B. Adding Wire
In this case, a wire (modeled as a conductance and a current source corresponding to wire capacitance) is inserted into the th internal node of the circuit. To solve the updated circuit, we need to consider the modified conductance matrix (6) where is almost identical to except that the location increases by , and is a column vector and has a 1 at the th location and zeros otherwise. Since can be obtained from via low-rank update:
, an LU factor of can be implicitly obtained by using the matrix inversion lemma as in (5) . Now consider an arbitrary linear system solution with respect to and a right-hand side . The set of circuit equations are partitioned into (7) Substituting the upper part of (7) into the lower part gives (8) where is a scalar, therefore, its inversion is trivial. Once is obtained, can be solved using the upper part of (7). 
C. Merging Two Networks
Now consider the delay update after merging two subnetworks. As shown in Fig. 3 , a network described by system matrices is being merged into another network at its th internal node. For the first network, it is assumed that the system matrices are constructed by treating the merging point as an input voltage port. Therefore, can be used to select the port current when the voltage of the th internal node of the second network is provided as (9) If the merging point for the first network varies, a reroot procedure (see Section III-A) can be used to implicitly construct an LU factor of the matrix. After the merging, the dc system equation for the network becomes (10) where is a column vector with a proper length and has a 1 at the th location and zeros otherwise. The previous equation gives the voltage at the th node as . Substituting (9) into the previous relationship leads to (11) From (11), the rest of circuit responses in both networks can be subsequently solved.
IV. UNBUFFERED STEINER NETWORK
A. Greedy Link Insertion
In this section, we introduce a greedy heuristic which inserts links in an existing Steiner network such that the slack is maximized considering the double-sided timing constraints. This heuristic can be applied either for obtaining a non-tree network from a tree or as a subroutine in a constructive Steiner network algorithm.
For the given network , which can be either tree or non-tree, we first identify its early critical sink and late critical sink (both defined in Section II). Next, we find the shortest path which connects the two critical sinks (using only existing edges). For each node , we tentatively insert a link between and each edge with the shortest connection. If node is at coordinate , and the two ending nodes of are at and , respectively, the link is inserted between node and location , where and . For each link insertion result, we evaluate the slack of the network. The link which gives the maximum slack improvement is finally chosen to be inserted. After one link insertion is completed, we repeat this procedure till there is no slack improvement. In the case when short path constraints are neglected, the role of the early critical sink is played by the sink with the maximum late slack.
B. Discussion on Topology
The effect of link insertion depends on the initial tree topology. Previously, there were many discussions on the area-radius tradeoff among tree topologies [1] . The area refers to the total wirelength and the radius is the maximum source-sink path length in a tree. The two extreme cases of this tradeoff are: 1) chain-like topology [see Fig. 4(a) ], which has small area and large radius, and is usually derived from minimum spanning tree algorithms; 2) star-like topology [see Fig. 4 (b)] with relatively large area and small radius, and can be obtained from the shortest path tree or rectilinear Steiner aborescence (RSA) algorithms [1] .
The major weakness of a tree with chain-like topology is that the delay of some sink may suffer from the long path length. For example, if in Fig. 4(a) is the late critical sink with tight delay upper bound, the long detour may cause large delay constraint violation. If we include non-tree topology into consideration, we may reach different conclusions. If a link (dashed line) is inserted in the chain-like topology as in Fig. 4(a) , the long detour problem is eliminated and the small wirelength is still enjoyed. However, if the late critical sink is instead, perhaps the star-like topology in Fig. 4(b) is still better. Thus, it is not clear which tree topology can facilitate a good non-tree solution in general. A main effort in our constructive algorithm is to probe different topologies so that the chance of capturing good non-tree solutions can be increased.
C. Constructive Steiner Network Heuristic
If we treat a network as a tree plus links, the problem of network construction can be accordingly decomposed into finding a proper tree topology and link insertions. We combine these two concerns into a dynamic programming based heuristic. This heuristic is a bottom-up merging procedure where multiple candidate solutions are generated to probe good topologies and link insertions. At the beginning, a set of subnetworks are initialized with the sink nodes. In each iteration, a pair of subnetworks are selected to be merged. Different merging solutions are generated. For each new subnetwork resulting from a merging, another candidate solution is generated by inserting a link in it. These candidate solutions are propagated toward the source.
Solution Characterization and Pruning: A candidate solution is a subnetwork rooted at node . It can be characterized by the total load capacitance , delay lower bound and delay upper bound at . It is easy to derive that the delay upper bound is same as the late slack of . Similarly, the delay lower bound is equal to the negative of early slack. If there is another candidate solution at node and it has the exactly same sink set as , the two solutions can be compared for pruning. If , and , solution is inferior. During the solution propagation, inferior solutions are pruned out. The procedure for checking whether solutions are inferior is called inferiority check.
Merging Selection: We propose two merging selection criteria for two different scenarios: 1) long path constraints and short path constraints are almost equally tight and 2) long path constraints dominate.
For the first scenario, we use a merging scheme similar to prescribed skew clock tree routing [10] . In fact, when the delay upper bound of each sink is equal to its delay lower bound, i.e., the delay constraints degenerate to a single value target, this problem is equivalent to prescribed skew clock routing. In prescribed skew clock routing, the subtree with the maximum delay target is merged first to reduce the chance of wire detour [10] . Since we have delay upper and lower bound instead of a single delay target, we use the average as the criterion. The is the anticipated wire delay from the root of the subnetwork to the source node. This is to encourage subnetworks with roots far away from the source to be merged early. In each iteration, we first choose the subnetwork with the maximum , and then merge it with its nearest neighboring subnetwork.
The second scenario is more like traditional signal routing [1] . Therefore, we adopt a merging criterion similar as that of RSA [11] . That is, we choose a pair of subnetworks whose merging root is farthest from the source among all pairs. If we consider merging subnetworks rooted at and , then the merging root is at , where is the location of the source node. Then, the pair with the maximum value is selected for a merging. Our method is different from the well-known RSA algorithm [11] which restricts all sinks in one quadrant if the source is at (0,0). Our merging selection can handle the cases that sinks are distributed in multiple quadrants.
Merging: After a pair of subnetworks are selected, we consider two types of mergings between them. One is the rootroot merging as in Fig. 5(a) , where subnetwork and are merged at node . The other is the shortest merging where two nodes from the two subnetworks with the minimum distance are connected directly. After the merging, the node closest to the source is selected as the root for the merged network. For example, in Fig. 5(b) , the merging between and is obtained by connecting and where reroot occurs. Then, is chosen as the root. The root-root merging is very similar as the RSA [11] heuristic which leads to star-like topology. The shortest merging is more likely to result in chain-like topology. By having these two different types of mergings, various topologies can be generated to compete for the best slack solution.
Nondisjoint Merging: If two subtrees contain some common sink nodes, i.e., they have nondisjoint sink sets, the merging between them results in a non-tree even without link insertion. In previous work on dynamic programming-based buffered Steiner tree [12] , such merging is forbidden since a tree topology needs to be retained. In general Steiner network construction, whether or not to allow the nondisjoint merging becomes less clear. According to our experience, allowing nondisjoint mergings increases runtime dramatically with insignificant gain on timing slack.
Link Insertion: For the two subnetworks obtained from mergings, we insert a link in each of them. The link insertion procedure is almost the same as that described in Section IV except that only one link is inserted for each subnetwork. There are two reasons for this difference. First, the link insertion here is an intermediate step and trying multiple links in one step is computationally expensive. Second, the size of a subnetwork is relatively small and, therefore, the number of necessary links is normally small.
Solutions at the Source: At the source, there are a set of solutions with different capacitance and slack tradeoff. One can choose either the maximum slack solution or the minimum capacitance solution without negative slack.
We say that a solution is incomplete if it has more than one subnetworks. Otherwise, it is complete. The pseudo-code for the Steiner network construction algorithm is given in Fig. 6 .
V. BUFFERED STEINER NETWORK
In this section, we describe the constructive method for simultaneous Steiner network construction and buffer insertion. As deal with a net driven by multiple drivers, two issues, which do not exist in link insertion for unbuffered Steiner network, have to be considered. One is the risk of short circuit between different drivers. The other is the fast estimation of signal delays in multidriver nets.
A. Short Circuit Avoidance
In a buffered non-tree, it is likely that multiple buffers drive the same subnet. Then, there is a risk of short circuit from one buffer to another, especially when the signal arrival times to the two buffers are quite different. 1 It is noticed in [13] that the short circuit can be avoided if the arrival time difference is smaller than the signal propagation delay between the two buffers. To be safer, we can require that the upper bound of the arrival time difference is smaller than the lower bound of the delay between the two buffers.
Denote the two buffers by and . The lower bound of signal propagation delay from to can be obtained through the method of [14] as follows: (12) where indicates two end nodes of an edge and is the edge resistance and is the total capacitance downstream of node . Similarly, we can obtain the lower bound of signal propagation delay from to . Denote by the upper bound of the difference between signal arrival time to and considering variations. The criterion for avoiding short circuit between and is [13] 
where is a constant used for added safety margin. Refer to [13] for further details.
B. Multi-Driver Delay Estimation
Although signal delay in a multidriver net can be computed by SPICE or model order reduction methods such as AWE [15] , an Elmore-like method is necessary for fast delay estimation during optimizations. Such a method is very similar to the one proposed in [13] . For completeness, we include it in this paper.
The main idea is to transform a multidriver net to an equivalent single driver net. For the convenience of presentation, the method in [13] is described using a dual-driver case. If there are two drivers for a net as in Fig. 7(a) , the term of signal delay is not well defined as the signal departure time from the two drivers may be different. As such, we need to find the signal arrival time to a node in the net given the signal departure time and from node 1 and node 2 in Fig. 7(a) . Without loss of generality, assume that , i.e., and . We are to insert a virtual resistance between the signal source and 1 See [13] for more details and illustrations.
Fig. 7. [13]
The dual driver net in (a) can be converted to the single driver net in (b) when signal departure time t at node 1 is no less than the signal departure time t at node 2.
node 1 such that the signal delay across the virtual resistance is and the signal departure time from is . Setting the signal departure time at and both to after this insertion, we can merge with into a single source as shown in Fig. 7(b) . It remains to find the value of the virtual resistance such that the delay across it is equal to . Since the net in Fig. 7(b) has a single driver, the signal delay at node 1 is well defined by letting the signal departure time . Clearly, is a function and [13] . Following [13] , for the current non-tree, we can find a single node, which is called joint node, such that the non-tree can be transformed into a tree by tearing the joint node into two separated nodes and . For Fig. 7(b) , node tearing separates the non-tree into subtree , which is driven through and , and subtree which is driven through . If node and node , the delay at node 1 is , where is the total downstream capacitance at node 1 in subtree . After node and node are merged back, the delay at node 1 satisfies [7] (14)
where and are delays at node and before the merging. The values of and are equal to the Elmore delay at and , respectively, when node capacitance and the other node capacitance are zero [7] .
Following [13] , can be decomposed as , where is the delay from node 1 to before merging. Similarly, can be decomposed as , where is the total path resistance from node 1 to node . The value of is equal to the total path resistance from node 2 to node before the merging. The value of can be then derived from (14) as (15) 
C. Simultaneous Steiner Network Construction and Buffering
The simultaneous buffering and Steiner network construction algorithm goes in the same way as the unbuffered case described in Section IV-C except that at each merging point, we consider the possibility of inserting a buffer there. That is, at each merging point, both buffered solutions and unbuffered solutions are placed into the solution set. To update the delay/slack of the buffered Steiner subnetworks, the method described in Section V-B is employed. In addition, to avoid short circuit problem, a buffer can be inserted only in "feasible" positions, i.e., those merging points such that the criterion in (13) is satisfied. We find that allowing buffer insertion at each feasible position may result in too many solutions and make the algorithm very inefficient. Speedup techniques are necessary. As such, we set up a threshold and a buffer is inserted when the number of noninferior solutions is less than . A buffer is inserted also when a random number rand(0,1) is smaller than a value set by user, where rand(0,1) denotes a random number generator with output ranging from 0 to 1. As indicated by our experiments, this strategy improves the efficiency of the algorithm with slight degradation in solution quality. Refer to Fig. 8 for the pseudo-code of the simultaneous algorithm.
VI. EXPERIMENTAL RESULTS
All algorithms are implemented in C++ and the experiments are performed on a PC computer with 3.2-GHz processor and 1-G memory. We generated different testcases with the number of sinks ranging from 5 to 25, which are typical sizes for signal nets in reality. Without loss of generality, we let the source be at coordinate of (0,0). In some cases, all of the sinks are in one quadrant while some other cases have sinks distributed in four quadrants. For example, in the data tables, the notation of "15 s, 2 Q" means there are 15 sinks and they are distributed in two quadrants. The 70-nm technology parameters reported in [16] 2 We compare the following methods.
• AHHK: This is a Steiner tree heuristic [1] which can achieve different area-radius tradeoff by varying a parameter . When the value of is shifted from 0 to 1, the resulting tree gradually changes from chain-like to star-like topology [1] . Although it is not directly timing driven, we can achieve very good timing performance by trying different and choosing the result with the best slack. We tested AHHK trees with ,0.5,1 in the experiments.
• AHHK+detour: If there is short path violation, the edge incident to the early critical sink is elongated to increase the delay till the early slack is close to the late slack, so that the overall slack is maximized.
• AHHK+link: Inserting links greedily as described in Section IV-A. This method is similar to [2] .
• Steiner Network: The dynamic programming based unbuffered Steiner network construction proposed in Section IV-C. Steiner network construction method proposed in Section V-C where "link insertion" is disallowed.
• Buffered tree+link: This is to investigate whether simply performing link insertion to buffered trees may result in good routing topologies. A Buffered tree+link is obtained by performing a greedy link insertion heuristic in a Buffered tree, i.e., in the bottom-up order, for each subtree rooted with a buffer, a link is inserted if it achieves the maximum slack gain.
• Buffered AHHK+link: This is to compare with optimal buffering followed by greedy link insertion. We start with AHHK tree and perform optimal timing-driven buffer insertion to it. Greedy link insertion is then performed to the resulting buffered tree. Note that Buffered tree is built by simultaneous tree construction and buffer insertion from scratch, while Buffered AHHK is obtained from buffering an AHHK tree.
• Buffered Steiner Network: The dynamic programming based buffered Steiner network construction proposed in Section V-C. In our experiments, to make our results and analysis more general, the timing constraints for nets range from a few dozen pico seconds to several thousand pico seconds, and thus they can be fitted into design with the clock cycle time ranging from several hundred pico seconds to several nanoseconds.
A. Cases With Single Critical Sink
Our experiments are performed on a large set of nets and in this section, we choose the following nets to present our results and analysis. They are 15 nets with 5, 10, 15, 20, and 25 sinks and sinks in 1 quadrant, 2 quadrants, and 4 quadrants. Each net has a single critical sink which is often on the long path. Therefore, wire detour is rarely necessary here. In Table I , we compare AHHK and AHHK+link on 10 cases among the 15 nets where links are indeed inserted. The average results in the last row show that link insertion can improve slack by about 428 ps with about 15% increase on wirelength. The link insertion can also achieve about 37% two-connected wires, i.e., about 37% of the wires are tolerant to open faults. The comparison between our constructive Steiner network heuristic and AHHK+link is made in Table II for the entire 15 nets. For AHHK+link, we pick the results of with the best timing slack. Among multiple solutions generated by the constructive heuristic, we report the solution with the best slack and largest wirelength, which can improve the slack from 21 to 215 ps on average according to the last row of Table II . By this, we achieve 7%-18% improvement in the ratio of slack to the timing constraint. The wire increase of our Steiner network heuristic is only 6% over the AHHK+link results.
The dynamic programming based Steiner network construction can generate a set of solutions with different slack-wirelength tradeoff. This is confirmed by a plot in Fig. 9 which is obtained from a 25-sink net.
As routing blockage is an important issue in reality, we also consider to handle blockages in our Steiner network heuristic. For this, the following modifications are made. The algorithms (including Steiner network and AHHK) are the same as before except that during the process of generating candidate solutions, whenever an edge is inside a blockage, it will be rerouted. For this, we first identify the intersection of the edge and the blockage, and then we rerouted the edge along the boundary of the blockage. In the experiments, we randomly place blockages with total area summed to 20% that of the smallest bounding box of each net, which is similar to the way of blockage generation in [17] . The results are summarized in Table III . Clearly, Steiner network heuristic still performs better than AHHK+link when blockage is handled.
Monte Carlo simulations (5000 runs for each result) are performed to observe the behaviors of these algorithms under process variations. We consider wire width, sink capacitance, and driver resistance variations which are assumed to follow Gaussian distribution with standard deviation equal to 5% of nominal value. The comparison between AHHK trees and AHHK+link results is in Table IV . The mean values of the slacks are about the same as the deterministic results in Table I . On average, AHHK+link can reduce the standard deviation of slack by about 10% and increase timing yield from 1.6% to 21.2%. The timing yield refers to the probability of nonnegative slack. The data in Table V indicates that our constructive method can reduce the standard deviation further by 10% and improve the timing yield from 61% to 100% compared to AHHK+link.
In order to better capture the impact of process variation on circuits, the modeling proposed in [18] is also used to perform Monte Carlo simulations. It is a first-order approximation and is able to capture the major component of variations [18] . As an example, we consider two sources of variations for each gate, which are gate length and threshold voltage . Note that other sources of variations can be easily incorporated into the model. For each gate, the input capacitance is modeled as (16)   TABLE V  MONTE CARLO RESULTS CORRESPONDING TO TABLE II   TABLE VI  MONTE CARLO RESULTS CORRESPONDING TO TABLE II USING THE MODEL IN [18] where is the nominal value, and refer to the sensitives of gate capacitance to gate length and threshold voltage, respectively, random variables and represent variations in gate length and threshold voltage, respectively, and random variables and refer to the local and global variations. We similarly model the gate driving resistance .
As in [19] , to handle spatial correlation, grids are layered upon the circuit and each gate belongs to a grid which is indexed by . Incorporating spatial correlation into consideration, we have (17) where are the parameters.
(and ) is independent of each other which can be obtained through performing principal component analysis (PCA) to the originally correlated random variations. Refer to [19] for the details. With the previous new model, Monte Carol simulation results are summarized in Table VI . These results confirm that our Steiner network method significantly improves the timing yield compared to AHHK+link. The proposed Steiner network heuristic can be easily incorporated into the existing global routers. As an example, we incorporate it with Labyrinth [20] . The algorithm is as follows. The global routing is first computed and timing analysis is performed on the resulting circuit. Signal nets along the timing critical path are then identified and rerouted using the proposed Steiner network heuristic. We apply the previous algorithm to IBM-PLACE benchmark circuits which are the benchmarks converted from ISPD'98 circuits 3 and the results are summarized in Table VII . Since the original circuits (e.g., gate type) for IBM-PLACE benchmark are not known to us, circuit information (e.g., gate resistance and capacitance) is randomly generated. From Table VII , one sees that Steiner network heuristic can still improve the timing yield. In addition, since it is applied only to the nets along the timing critical path, the additional wirelength is quite small and the additional runtime is not large.
B. Other Unbuffered Cases
We tested the algorithms in cases with multiple critical sinks. That is, there may be several sinks with similar timing criticality in each net. In order to see the effect on fixing short path delay constraint violations, these testcases usually have tighter constraints on short path than on long path. The wire detour method can increase the delay to the early critical sink but at the cost of increasing long path delay. This is in contrast to our approach which can increase short path delay and reduce long path delay simultaneously. Moreover, wire detour cannot lead to any tolerance to open faults as in non-tree. The results in Table VIII show that our Steiner network heuristic can improve the slack by about 80 ps on average when compared to performing wire detour on existing trees. The wirelength increase due to our method is about 4% with respect to the wire detour results. We also run experiments on cases without delay lower bound which are the same as the conventional timing driven routing. Refer to Table IX for the results. We can see our Steiner network results can improve the slack by 125 ps with only 5% increase on wirelength.
In Fig. 10 , we give some plots for AHHK tree, AHHK+link and our Steiner network for comparison, where nonrectilinear edges are "soft edges" [21] which can be realized to rectilinear edges when necessary. These plots are for the "15 s, 1 Q" case with multiple critical sinks. In this case, the slack and the wirelength of AHHK are 190 ps and 9206 um, respectively, of AHHK+link are 375 ps and 10285 um, respectively, and of Steiner network are 497 ps and 10281 um, respectively. Clearly, compared to AHHK+link, Steiner network improves slack by 100 ps while still saving wires. Note that non-tree routing could introduce more congestion which sometimes makes the circuit more difficult to route, route with more detour or cause signal integrity issues. These could get worse if the routing is performed on small region. These are the drawbacks of our approach.
C. Handling Buffers
For randomly generate test cases, the results for Buffered Steiner network construction are summarized in Table X . We compare the results with Buffered tree and Buffered tree+link. For fair comparisons, we choose the sum of wire capacitance and buffer input capacitance as a metric to evaluate different topologies. For each case in Table X, two solutions at driver TABLE XI  MONTE CARLO RESULTS CORRESPONDING TO TABLE X   TABLE XII  COMPARISON BETWEEN BUFFERED AHHK+LINK AND BUFFERED STEINER NETWORK FOR THE TESTCASES WITHOUT LOWER DELAY BOUND are reported. One is the best slack ("B.S.") result which is the one with the maximum slack and the other is the best capacitance ("B.C.") result which is the one with positive slack and minimum capacitance. Note that "B.C." results for Buffered tree+link are not shown since they are the same as the "B.C." results for Buffered tree.
From Table X , one can see that the Buffered Steiner network significantly outperforms the Buffered tree topology in terms of slack (on average 510 ps versus 188 ps for B.S. and 111 ps versus 49 ps for B.C.) and it only introduces slightly more capacitance (on average 2.72 versus 2.61 for B.C.). When compared to Buffered tree+link, Buffered Steiner network on average obtains 100-ps slack improvement with only more capacitance. These demonstrate the advantages of buffered Steiner network routing topology.
Monte Carlo simulations (5000 runs for each result) are performed to observe the behaviors of algorithms under process variations. Wire width, sink/buffer capacitance and driver/buffer resistance variations are assumed to follow Gaussian distribution with standard deviation equal to 5% of nominal value. The comparison results are summarized in Table XI. The mean  values of the slacks are about the same as the deterministic results in Table XII . On average, Buffered tree+link can reduce the standard deviation of slack by about 6% and increase timing yield from 91% to 100% for B.S. case. Our constructive method can reduce the standard deviation by about 20% and improve the timing yield from 77% to 95% for B.C. case compared to Buffered tree.
We also compare the proposed Buffered Steiner network with Buffered AHHK+link. Recall that Buffered AHHK+link is obtained by performing optimal (in terms of slack maximization) buffer insertion to an AHHK tree followed by link insertion. Since the existing buffering algorithm only handles long path constraint, comparison is made on the testcases without lower delay bound. For Buffered Steiner network, the best slack results are chosen for comparison. Refer to Table XII for the comparison. From it, one sees that as our Buffered Steiner network builds solutions from scratch (with aggressive solution pruning approach), in general, it outperforms Buffered AHHK+link in terms of slack and variance.
VII. CONCLUSION
This paper investigates timing driven routing by using nontree topology. We propose Steiner network construction heuris-tics which can generate either tree or non-tree with different slack-wirelength tradeoff, and handle both long path and short path constraints. We also propose heursitcs for simultaneous Steiner network construction and buffering. Furthermore, incremental non-tree delay update techniques are developed to facilitate fast Steiner network evaluations. Experimental results show that our approach is very promising in improve timing slack and handling both long path and short path constraints.
We note the following recommendations for designs adopting non-tree routing. First, non-tree routing is only applied to global nets. Second, after performing statistical timing analysis, we identify the timing critical nets, and then non-tree routing is applied to those timing critical nets. Third, we may also perform statistical sensitivity analysis as proposed in [22] and apply non-tree routing to those nets with high sensitivity to variations. In future, we will design non-tree routing method using more accurate delay model.
