We present the first efficient approach to global routing that takes spacing-dependent costs into account and provably finds a near-optimum solution including these costs. We show that this algorithm can be used to optimize manufacturing yield. The core routine is a parallelized fully polynomial approximation scheme, scaling very well with the number of processors. We present results showing that our algorithm reduces the expected number of defects in wiring by more than 10 percent on state-of-the-art industrial chips.
INTRODUCTION
Because of the huge size of VLSI routing instances, a global routing step is usually performed before detailed routing, defining an area for each net to which the search space will be restricted. In its simplest form, the global routing problem amounts to packing Steiner trees in capacitated undirected graphs. Traditionally, the main objective was netlength minimization. However, with the decrease of feature sizes, effects depending on wire spacing become more and more important: First, manufacturing yield improves with a better spreading of wires, and second, coupling capacitance becomes increasingly important for power consumption and signal propagation delay. On the other hand, spacing usually can be increased only at the expense of netlength, which also has a negative effect on yield, power and timing. Recent papers on global routing (Xu et al. [13] , Jing et al. [7] , Ho et al. [5] ) propose heuristics partially adressing this tradeoff, but there is no approach yet that gives any performance guarantee (see [6] for a survey).
Deciding if a global routing instance has a feasible solution or not is NP-complete even when restricted to two-terminal nets which have to be routed in planar grid graphs with unit capacities [9] . Raghavan and Thompson [10] proposed to solve an LP relaxation first and then apply randomized rounding to obtain an integral solution whose maximum violation of the packing constraints can be bounded. However, given today's huge instance sizes, even solving the LP relaxation exactly is far too slow in practice. Therefore approximation algorithms are interesting. Shahrokhi and Matula [11] proposed the first fully polynomial approximation scheme for multicommodity flows, which then was applied to global routing by Carden, Li and Cheng [2] . Garg and Könemann [4] presented a simpler and more efficient approximation scheme, modified and applied to global routing by Albrecht [1] .
Recently, Vygen [12] proposed the first global routing algorithm taking spacing-dependent costs into account, guaranteeing to find a solution with almost optimum total cost. However, this is a purely theoretical work that does not consider manufacturing yield, and the algorithm cannot be parallelized efficiently. In this paper, we describe an enhanced and parallelized version of the algorithm that efficiently optimizes manufacturing yield and scales very well with the number of processors. We also present results on a number of state-of-the-art VLSI designs from industry, showing that the expected number of defects in wiring can be reduced by more than 10 percent on average.
This work is organized as follows. In section 2, a formal description of the global routing problem with spacingdependent costs is given. Section 3 presents a parallelized fully polynomial approximation scheme for efficiently solving a fractional linear programming relaxation of this problem. Section 4 briefly describes the randomized rounding step applied to obtain an integral solution. Section 5 introduces the defect model that we assume for measuring defect sensitivities and shows how yield can be optimized with our algorithm. In section 6, we present results on several current industry designs which demonstrate the yield optimization capabilities of our algorithm and show that indeed our algorithm is parallelized very efficiently. We conclude the paper with some discussion and outlook on future work.
THE GLOBAL ROUTING PROBLEM
This work neither assumes a routing grid, nor is it limited to Manhattan routing, so also diagonal wiring is possible. We construct a global routing graph by partitioning the chip area into regions (forming the vertices of the graph) and joining adjacent regions by an edge. Each edge is assigned a capacity value that indicates how many wires of unit width can join the two regions such that all design rules are met.
In the following, let G be the global routing graph, with edge capacities u : E(G) → R+. Let N be the set of nets to be routed. For each pair (e, N ) ∈ E(G) × N we are given a constant we,N which denotes the width of a wire of net N using edge e. This allows for choosing wire types not only depending on the net, but also plane-or even regiondependent. We also have a cost interval [c Now, for each net N ∈ N we construct a global routing net as follows. First, for each pin we determine the set of vertices of G corresponding to regions intersected by its shapes. Then, iteratively, we merge each pair of sets with nonempty intersection until no such pair exists any more. Each of the resulting sets is called a global routing pin of N .
The task now is to find a connected subgraph YN of G for each N ∈ N , intersecting all of its global routing pins, and extra space assignments such that edge capacities are respected and total cost is minimized. It is possible to restrict the set YN of feasible subgraphs YN of G for net N , e. g. to Steiner trees or, for timing critical nets, to Elmoredelay-optimal Steiner trees. However, this is not important for our approach. Note that YN does not need to be specified explicitely, but can be given implicitely by an oracle function returning a cost optimal element from this set.
Additionally, cost bounds can be assigned to subsets of N , so we are given a family M of subsets of N with bounds U : M → R+ and weights σ(M, N) ∈ R+ for N ∈ M ∈ M. We require that N ∈ M.
Let ∆e,N := c max e,N − c min e,N . We can formulate the global routing problem more formally now: the task is to find an YN ∈ YN and numbers 0 ≤ ye,N ≤ 1 for each N ∈ N and e ∈ E(YN ) (denoting the fraction of possible extra space se,N to be assigned to net N on edge e), such that X
for M ∈ M, and such that X
is minimum.
FRACTIONAL GLOBAL ROUTING
Since it is hard to optimize the integral version of the global routing problem, we start with a fractional linear programming relaxation.
LP Formulation
We first observe that it is sufficient to find a feasible solution as we can impose an arbitrary upper bound U (N ) on (3) and perform binary search to find the optimum value of U (N ), in each step solving a linear program that maximizes the minimum slack in (1) and (2), followed by randomized rounding (cf. section 4).
Our experiments show that binary search is not needed in practice since good lower bounds on (3) can be computed and the algorithm comes close to these bounds within a few percent on all of our test cases, so it suffices to solve one linear program. In either case it is worthwile to change the objective from minimizing (3) to maximizing the minimum slack in (1) and (2) because this can considerably improve running time of detailed routing and in many cases increasing slacks in the cost bounding inequalities is desirable.
So we approximately solve the following linear program:
The corresponding dual LP is:
The dual LP enables us to compute a lower bound on the optimum LP value: 
where
µM is a lower bound on the optimum LP value.
Proof. Set χe,N := max˘0, ∆e,N γN − se,N ωe¯, and
Then we get a feasible solution (z, µ, ω, χ) of (5) after dividing all variables by
Parallel Fractional Global Routing Algorithm
Now we are ready to present the approximation scheme for approximately solving (4) and (5). It is a primal-dual algorithm that takes four parameters 0 < , 1, 2 < 1 and t ∈ N, which control the approximation guarantee and running time of the algorithm, and a parameter Π ∈ N specifying the number of processors to be used in parallel. Furthermore, we assume to have a lower bound LN on
of any Steiner tree Y ∈ YN (N ∈ N ). The quality of this lower bound is important for obtaining good running times in practice. Often a nontrivial lower bound can be computed in constant time using distances between the terminals. We have implemented a shared memory multithreaded version of the algorithm, but also implementations using processes distributed on different machines should be quite efficient because the need for communication between processes is very low.

Input:
An instance of the Global Routing Problem, t ∈ N, , 1, 2 ∈ R+, and Π ∈ N.
Output: A feasible solution to (4).
x Set αe := 0 and ωe = Let ψe,N be defined as in (6) .
If ∆e,N γN < se,N ωe then δ := 0 else δ := 1.
Procedure CollectVariableUpdates:
For each e ∈ E(G) do: For π := 1 to Π do:
For π := 1 to Π do:
Procedure ResetVariableUpdates:
Let us call one iteration of the outer loop (step y of the algorithm) a phase. While the previous algorithms ( [1, 12] ) work sequentially, routing the nets in a certain predetermined order and updating the dual variables after each completion of a net, our algorithm does not need an ordering of the nets. It turns out that for the analysis of the algorithm it is not important if updating the µ, ω and z variables is done immediately or delayed until the end of the current phase. Delaying updates drastically reduces interdependence between threads since changes of variables that are accessed by more than one thread occur only at the end of a phase. This makes it possible to implement the algorithm using almost no locking and scaling very well with the number of processors used. Besides that, aggregating and delaying the updates of the z variables until the end of the current phase eliminates the innermost loop of the algorithm in [12] which easily dominated running time. Therefore not only parallelization becomes more efficient, but also total CPU time is reduced.
The following theorem shows that the presented algorithm is a fully polynomial approximation scheme for solving the fractional global routing problem:
such that with parameters , 1, 2 and t the presented algorithm finds a (1 + 0)-optimal feasible solution to (4) .
Proof. Our algorithm makes use of the α and β variables to record each thread's contribution to dual variable updates. At the end of a phase, the contributions from each thread are summed up in order to compute new values for the ω, µ and z dual variables. With some modifications, the analysis of the algorithm in [12] can be made to work also for our algorithm with delayed updates of dual variables: Clearly, estimating the increase of the µ and ω variables in inequalities (9) and (10) of [12] does not need to be done net by net. If all nets are considered at once, nothing changes except that the bound on the increase has to be expressed in terms of dual variable values at the beginning of the current phase. This, however, does not hurt in the continuation of the proof. Table 1 shows that our parallel algorithm scales very well with the number of processors: We did experiments with two large 130 nm chips (with 2.5 million and 2.8 million nets, respectively) on an IBM S85 machine with 96 GB memory and 18 CPU's running at 600 MHz. This machine is slower than a current Opteron machine by a factor of 3.5 to 4, so much better running times can be obtained on current machines. We routed the chips using 1, 4, 8 and 16 threads in parallel. The corresponding speedups in some cases are even slightly higher than the number of processors. This is probably due to caching effects.
Future cost
Finding near-optimum Steiner trees in the RouteNets procedure is the core routine in our algorithm that consumes most running time. Since all edge weights are nonnegative, we can apply the algorithm of Dijkstra for two-terminal nets. For nets with more than two terminals, we first find geometrically optimum Steiner trees and then perform a series of path searches connecting the terminals and Steiner points to each other, also applying Dijkstra's algorithm. Alternatively, we can use the algorithm of Dreyfus and Wagner [3] to find optimum Steiner trees directly (note that this algorithm in fact is a generalization of bidirectional search).
In either case, an important factor speeding up the algorithm is the use of a future cost estimate, which is a lower bound of the distance of vertices to a given set of target vertices. Suppose we look for a path from s to t in an undirected graph G with vertex set V , edge set E and edge lengths c : E → R+. Let l(v) be a lower bound on the 
Balancing geometrical and congestion parts of edge costs
The edge costs ψe,N for net N using edge e defined in (6) consist of a term`c max e,N − δ∆e,N´γN (δ ∈ {0, 1}) which for reasonable cost measures is approximately proportional to the geometrical length of the edge, and a second part we,N + δse,N´ωe that captures congestion costs. Of course, future cost can consider only the first part (with δ = 1).
Typically (depending on the size of the regions that form the vertices of the global routing graph) edge capacities are not much larger than 100 to 200 units. On the other hand, often a net N has large minM:N∈M∈M |M |. Therefore, since ωe := (M ∈ M) after initialization, edge costs are dominated by the congestion part in the first phases, even in regions with very low capacity utilization. This means that future cost becomes very poor and leads to high running times.
Fortunately, we can balance the geometrical and congestion parts of the edge costs without actually changing the routing instance: It suffices to duplicate the capacitance constraint (2) for the group N of all nets a certain number of times in the primal LP (4). This does not change the linear program, and is equivalent to a multiplication of the dual variable µN . In this way, the geometrical edge cost parts get higher weight without adding complexity to the algorithm, resulting in better future cost and running times. In our implementation, we choose µN such that, taking the average over all edges, geometrical and congestion parts of the edge costs are equal after initialization in step x .
RANDOMIZED ROUNDING
To obtain an integral solution, we use randomized rounding as in [12] . Let (x, y, λ) be a fractional solution to the primal LP. To obtain an integral solution (x,ŷ,λ), we do the following. otherwise. Finally, we chooseλ minimum possible such that (x,ŷ,λ) is a feasible solution to (4). We have: . 2
In our experiments, only very few upper bounds are violated after randomized rounding, and only a few ripup and reroute steps are needed to correct them.
YIELD OPTIMIZATION
Critical area analysis as in [8] was used to evaluate the yield optimization results of the algorithm presented in this paper. Only the chip-level routing was considered, but not the internal structure of custom macros and library cells.
In this approach, we are given a defect size distribution provided by manufacturing. In our case, the distribution is f (r) :=  0, r < r0 c r 3 , r ≥ r0 for some r0 ∈ R+ smaller than the smallest possible particle that can cause a fault, and c such that
f (r)dr = 1. Then the critical area w. r. t. short-defects on plane z is
f (r) dr dy dx, (7) where t short (x, y, z) is the smallest size of a particle that causes a short-defect at location (x, y, z), and κ z short is a weighting factor provided by manufacturing that encodes the relative probability of short defects on plane z. Similarly,
f (r) dr dy dx, (8) where topen(x, y, z) is the smallest size of a particle that causes an open-defect at location (x, y, z), and κ z open is provided by manufacturing.
For simple situations like a straight wire without neighbours or straight wires running in parallel, these values can be computed analytically. For yield optimization, we can compute spacing-dependent costs as follows: For a wire of width w in plane z surrounded by neighbouring wires at distance d on both sides, we define the contribution of this wire to critical area by integrating (7) and (8) over the Voronoi region of this wire, i. e. the set of points in plane z where this wire is the closest object. Ignoring wire ends and assuming r0 < min{ 
« . To compute the expected number of faults within chiplevel wiring of a complete VLSI chip, we use a Monte Carlo dot-throwing approach as in [8] to determine the planewise critical area values, sum up these values and finally multiply the result by chip area.
RESULTS
We have run the algorithm presented in this paper on a number of current VLSI designs from industry. Table  2 shows our testbed, consisting of three 90nm designs, six 130nm designs, and one 180nm design. All designs use horizontal and vertical wiring only and have 6-8 routing planes.
In While this already demonstrates the huge potential of yield optimization in global routing, our main focus is on quality of results after detailed routing. We compare our results after detailed routing to those obtained by using an implementation of Albrecht's global routing approach [1] which focuses on netlength and congestion only and is a two-dimensional approach, which means that all horizontal and vertical planes, respectively, are merged to a single plane, resulting in a smaller global routing graph. The twodimensional global routing step is followed by a heuristic plane assignment step which tries to assign nets to routing planes which are good from a yield perspective. We summarize the results of this implementation in the "2D-GR" columns in tables 5 to 8.
In our approach, we have to see the different plane characteristics, so we have a three-dimensional global routing graph. This costs running time in global routing, but on the other hand can save time in detailed routing since search space can be restricted also in z dimension. The results obtained when using our global routing algorithm are shown in the "3D-GR" columns. All yield optimization experiments have been done with the κ1 weights from above.
The results in tables 5 to 8 have been produced on an Opteron machine with 2.6GHz and 64 GB memory, using a single processor. Table 5 compares global routing running times of the different approaches. Compared to Albrecht's approach, our algorithm needs considerably more CPU time due to the larger global routing graph. However, Albrecht's approach cannot be parallelized efficiently, but our runtimes can be cut down drastically by parallelization, as was demonstrated in table 1. Using eight processors on a current machine, all chips in our testbed could be routed in less than two hours, both with netlength and with yield optimization. Table 5 also shows that our algorithm is considerably slower if yield is optimized instead of netlength. There are two reasons for this: First, our experiments show faster convergence for netlength optimization than for yield optimization. In the first case, good results are obtained already after 10 to 20 phases, while for yield optimization typically 40 to 50 phases are needed (however, this is still much better than what can be guaranteed theoretically, cf. [12] ). The second reason is that due to the differences between defect sensitivities on different routing planes, future cost quality degrades significantly: Future cost must always be a lower bound, but not all nets can be routed on the best possible plane, of course. Also in detailed routing, future cost is essential for obtaining good running times. Future cost computation in our detailed router does not take into ac- count detours prescribed by global routing, so path search takes more time if global routing has been done with the objective of optimizing yield because many nets are routed with some detours to obtain better spreading. This is also reflected in table 6, and table 7 shows that also the number of vias goes up significantly when optimizing yield. This seems contrary to the fact that vias generally are considered harmful for manufacturing yield, but it turns out that with the open and short defect weights used in our experiments, it is worthwhile to spend extra vias in order to put more wiring on higher planes. Table 8 shows the estimated number of faults per chip in the wiring (including vias) after detailed routing. Note that the results in the first column are better than in the second one because of the heuristic plane assignment step. The third column shows that our approach, which provably finds a near-optimum solution, can again drastically improve results compared to the first column, in average by more than 10 percent.
Note that exactly the same detailed router with the same parameters was used for all our experiments. By adapting the detaild router optimally one can certainly improve the yield even further.
CONCLUSION
We have presented an enhanced global routing approach and have shown how to use it for yield optimization. We have run it on several state-of-the-art VLSI routing instances from industry and showed that our algorithm reduces the expected number of faults significantly, in average by more than 10 percent. We can use it to considerably improve yield. Our algorithm is parallelized very efficiently and can route even the largest chips in our testbed within less than two hours.
Our algorithm can also be used to optimize power consumption and to limit capacitance on critical paths in order to avoid timing violations after routing. Further experiments have to be done to demonstrate these capabilities on VLSI routing instances.
Moreover, we did not consider at all yield optimization in detailed routing in this paper. Certainly detailed routing offers further potential for significant yield improvements.
ACKNOWLEDGEMENT
First of all, I would like to thank Jens Vygen for many helpful discussions and suggestions. I also want to thank the anonymous referees for their very useful comments.
