Given the enormous amount of detailed geometry information, the large number of local nets, and the ability to properly partition a design routing has been studied thoroughly to utilize parallelism intensively. In this paper we first discuss how to divide the routing space into regions that reduce run time and memory usage without scarifying the quality of the results. Then we cover scheduling among the routing regions; because scheduling determines the effectiveness of parallel routing. We consider the locking, quality of results, and scaling for scheduling in a multithread environment. Experiments show good routing quality with significant speed up in detailed routing.
INTRODUCTION
Parallel computing has been studied for a while [4] . Mainly, it divides a program into subtasks and runs tasks in parallel. There are two kinds of environments, one is distributed and the other is multithread [9] . The distributed computing shares file systems among processors while the multithread computing shares memory. In this paper we will discuss routing under multithread environment.
Routing dictates the consideration of the highest level of physical information details, e.g. wires, vias, and cell geometries. This level of detail leads to large memory usage and long runtime. Improving both memory usage and runtime are two important goals of routing.
In global routing first the chip is divided into several global cells of equal size. Then the global router does topology routing on global cells for each net. In detail routing, the routing area is divided into partitions, and each partition is routed sequentially. Both methods reduce memory usage significantly because most of the physical information is only loaded while the global cell or partition is being routed.
As the chip size increases, the number of global cells and partitions increases accordingly. Runtime increase with the number of global cells and partitions. To further improve the runtime of routing, parallel routing of blocks is desirable. Moreover, most routing algorithms are super linear * This research is partly supported by Synopsys, Inc. October21-24, 2003, Beijing, PRC in their complexity; thus breaking the problem into smaller subproblems reduces the runtime. However, partitions can only be routed in parallel under certain constraints. For example, overlapping partitions or global cells should not be routed in parallel because conflicting routes might result in the overlapping area. The scheduling of the routing regions has a big impact on the quality of routing, also scheduling interacts with the parallel routing constraints. As a result, the routing region scheduling has a deciding effect on the parallelism.
ASICON'03
There have been different parallel routing investigations in the past. Some [2, 3, 11] focused on the global routing problem, others [1, 10, 13, 14] developed algorithms specific to the hypercube, while another subset [5, 6, 7, 8, 12, 15] have concentrated on the detailed routing problem.
We study the routing region, both detailed routing partition or global cells and the scheduling problem with the objective of achieving quality of results comparable to that of single processor algorithms while significantly reducing runtime. We focus on parallel routing algorithms for a symmetrical multiprocessor (SMP) environment. However, the general techniques described here are equally applicable to other parallel processing environments.
The remainder of the paper is organized as follows: In Section 2, we define the scheduling problem and related issues. In Section 3 we discuss in detail three important issues in multithreading: locking , routing quality, and scaling. In Section 4, we present one scheduling scheme that takes the above three issues together into consideration. In Section 5, we present experiments and results.
PROBLEM DEFINITION
Formally, in two-dimensional array global routing of multiterminal nets there are a set η = {N1, . . . , Nn} of multiterminal nets. The layout environment (plane grid) is a twodimensional m1 × m2 grid, being a rectangular tessellation of the plane. Each k-terminal net N is specified by a k-tuple [(x1, y1), ..., (x k , y k )], where (xi, yi), 1 ≤ i ≤ k, are the global cells containing terminals of N . The bounding box β of the N is the smallest rectangular regions contains all k-tuples. For example, three 3-terminal nets with its bounding box shaded are shown in Figure 1 . In a global routing, for each net, a sequence of global cells through which it passes, is specified. The U denotes the set of bounding boxes which has not been routed.
Figure 1: Bounding Boxes
Given that we have n threads (thread [1] . . . thread[n]) on a SMP machine (each thread is assigned to a different processor), the scheduling problem for global routing consists of picking which β to route subject to a constraint ( Figure 2 ).
.start(threadRoute()); end for threadRoute() { while U not empty do pick βi from U such that constraint(βi) satisfied Route(ui); end while } The constraint can be defined as βi does not overlap with any bounding boxes currently being routed.
A detailed routing area D is divided into a set of overlapping partitions {pi}, pi ⊇ D. The U denotes the set of unrouted partitions here. Note that, we overlap adjacent partitions because this tends to produce higher quality results. For example, with non-overlapping partitions the routing to the boundary of one partition might make it difficult to continue the route in the next partition. The lack of overlap limits the area we have to navigate and the missed obstructions could render the solution suboptimal . Also, we do not discuss the partition size since that involves a quality vs. memory and runtime trade-offs that should be considered for each implementation.
As in Figure 3 , we assume that the partitions form a grid pattern with partitions having the same height if they share a row and same width if that share a column. By substituting the β to p, the multithread routing flow in Figure 2 can be applied to the detailed routing.
The scheduling algorithm is defined by the constraint it uses. In the detail routing, if we enumerate the partitions in rowcolumn order as in Figure 3 , then we can define a simple 
Under this constraint, we route partitions in a bottom to top , left to right order. This scheduling can only route one partition at a time, so it is undesirable if we want to do parallel processing. However, this scheduling scheme produces excellent routing quality (we will explain why in Section 3.2), and we will use it as the baseline in our quality comparisons for multi-processor scheduling algorithms.
KEY ISSUES IN SMP ROUTING
In this sections, we address three important scheduling issues for multiprocessor routing -locking, quality of result, and scaling. Locking is a method for preventing memory conflicts between processors in a shared memory multiprocessor environment. Locking is often necessary, but it can lead to increased memory usage and runtime. We will show how scheduling for the bounding box β for global routing and partition P for detailing can be adjusted to reduce the amount of locking required. Quality of result (QoR) is another issue that must be considered when partitions are routed in parallel. Parallel routing algorithm often produce worse quality routes when compared to single processor sequential algorithms. We will show why this occurs and present a scheduling constraint that ensures high quality results. Finally, we will discuss scaling which measures the speedup achieved by using multiple processors in detail routing.
Locking
Locking addresses is a fundamental issue in SMP algorithms. Since all processors can access the same shared memory, memory conflicts can occur. A lock is a data object that helps manage processor interaction. A lock can be added to a data structure where we require a processor to acquire the lock before accessing the data and to release the lock when finishes. This prevents memory conflicts stemming from multiple processors trying to access the same data.
As an example of how memory conflicts can occur while routing: if we route two overlapping partitions or bounding boxes simultaneously, the two processors may interfere with each other by routing in the shared area at the same time.
For global routing the Equation 1 is enough to serve the locking purpose. However, if we simultaneously route two non-adjacent partitions in detailed routing that both intersect the same wire , then one processor may change the wire while the second processor is accessing it. This can cause the second processor to access memory improperly. One solution is to add locks to any data structures that are subject to contention among processors. However, there are consequences to using locks too freely. Locks require memory and runtime overhead. Furthermore they can affect runtime significantly when several processors contend the same lock.
Since there can be millions of wires in a routed design we want to avoid adding locks at the wire level. One way to do this is to change the constraint in the partition scheduling. If we do not simultaneously schedule partitions that can both intersect the same wire, then we do not need to add a lock on the wires themselves. If we define the separation between two partitions, θ(pi, pj), to be the minimum horizontal or vertical distance between their boundaries, then we can describe our scheduling constraint as follows:
We have only discussed memory conflicts due to wires, but we can adjust the separation distance to apply to vias, obstructions, or other data structures that might cause contention. Under the constraint given by Equation 3, when one partition is being routed, we will not schedule another partition if it is a nearby partition row or column. Figure 4 represents what this looks like for a typical case. 
Quality of Result
Routing quality is a crucial issue when multiprocessor routing is considered, particularly because it is very easy for quality to deteriorate when more than one processor is applied. For global routing, it is related to the net ordering issue. Assume the net order in sequential global routing is ρ1, ρ2, . . . , ρn and ρ(β) denotes order of a net has bounding box β. To get a result similar to sequential global routing, we can add the following constraint to the Equation 1
In detailed routing, a major reason for quality degradation is misalignment of wires. Consider three partitions in a row that contains a two pin net with one pin in the leftmost partition and the other pin in the rightmost partition. If we schedule the partitions in the left to right order or the right to left order, we will produce a simple L-shape route. However, if we route the first partition first, then the third partition, and then the second we may form a suboptimal route due to the misalignment of the wires. See Figure 5 . The simple scheduling constraint (Equation 2) leads to a row sweeping partition order (Figure 3 ). This scheduling prevents the misalignment problems shown in Figure 5 , and so it produces excellent routing quality. However, since Equation 2 only permits single processor routing, we need to form a different constraint to maintain quality in a multiprocessor environment. We define a partition to be the row adjacent, Φr to another partition if it is in the same partition row and is immediately to the left or right of the first partition. We define column adjacency, Φc similarly.
We define a scheduling constraint as follows:
We disregard the Φr and Φc constraints when there are no previously routed partitions in that row/column. Equation 5 prevents wire misalignment problem such as the one shown in Figure 5 . This can be applied in a multiprocessor setting as well.
Scaling
The main reason for using multiple processors while routing is to reduce the runtime. If we define T (n) as the runtime required to route using n processors, then the scaling, scaling(n), measures the speedup achieved relative to using a single processor:
However, scaling can be difficult to measure. It is affected by the machine load and depends on the underlying routing algorithm used to route the regions. Generally, it takes longer to route a given region when using multiple processors than it does when a single processor is used. This difference can be due to lock contention, additional overhead, or even the scheduling algorithm itself. Therefore, we focus on parallelism, which measures the number of partitions that are routed simultaneously. This metric measures how well the scheduling algorithm supports parallel routing, and does not depend on the actual routing algorithm used. If it take τ amount of time to route a design and numRegion(t) is the number of regions being simultaneously routed at time t, then our parallelism metric is:
In practice, this parallelism metric is easily calculated. We record the number of regions being simultaneously routed every time we begin or finish routing a region, and then multiply by the appropriate time interval. The parallelism reflects the scaling we would see if there was no overhead associated with parallel processing. The true scaling (Equation 6 is expected to be less than the parallelism (Equation 7). For our routing algorithm, we observe a close correlation between the parallelism measured and the true scaling observed.
To get a better parallelism, we have one more constraint to global router. Since run time of β is correlated to its area A(β), we categorize all bounding boxes into several groups, G, according to its size.
The Λ is the area of one global cell. The global router will not go to next level of group unless all the β of this group have been routed.
It should be noted that the parallelism is affected by the scheduling constraint used. The constraints we have proposed restrict which regions we simultaneously schedule. If we relax our constraints we may be able to route more regions at the same time at the cost of more locking and poorer routing quality. In such a case the true scaling may actually deteriorate. For our scheduling algorithm, we choose to restrict our constraints to greatly reduce the locking required and to maintain good routing quality. some heuristics can applied to promote reasonable parallelism for larger designs.
ALGORITHM
Our scheduling scheme follows the overall multiprocessor flow in Figure 2 . For global routing we combine constraints for better scaling (Equation 8) and for SMP (Equation 1). But there is one more issue we need to solve -the failure of Route(βi). This failure is due to the routing region of a net being restricted to its bounding box. Hence, the bounding box of a failed net will be increased to fit the size of next group until the bounding box equal to the size of chip.
if Route(βi) fail then increase βi area to fit next group end if Figure 6 : Rerouted Failed Nets
In detailed routing, we combine a constraint that reduces locking (Equation 2) with a constraint that ensures good routing quality (Equation 5). We add some heuristics to provide reasonable parallelism for larger designs for both global and detailed routing.
EXPERIMENTS
We report only detailed routing experiment results here. Three designs of varying size are routed. The first is .1 cm by .1cm which we divide into 144 partitions. It has 3500 nets route on three metal layers. The second is .13cm by .13cm with 868 routing partitions, 15500 nets, and four metal layers. The third is .22 cm by .16 cm with 3008 partitions and 84500 nets on six metal layers. We routed each design using one processor and the default scheduling (Equation 2) and computed the final wire length and via count as a measure of routing quality. Then we applied our scheduling scheme and used 2, 4, and 6 processor. In all cases, the router finished all nets without any design rule violations. Results are summarized in Table 1 . We see that the routing quality remains consistently high regardless of the number of processors used.
Next, we looked at the parallelism results. Our router used multiple rip up and reroute iterations to resolve design rule violations. In the initial iteration, we route all the partitions. In subsequent iterations, we only route the partitions that contain design rule violations. In the first iteration We focused our attention on the parallelism as we scheduled the maximum number of partitions and we used the metric given by Equation 7 . We applied our scheduling scheme and used 1, 2, 4, 6 processors on each region. Results are given in Table 1 .
We observe that for the smallest design, the parallelism plateaus around 2 regardless of the number of processors used. The second design can effectively use more processors, with parallelism topping out before 4. The largest design can use parallel processing best, with a parallelism factor of over 5 n 6 processors. Clearly the amount of parallelism achievable is related to design size. It is easier for our scheduling algorithm to assign partitions for routing when there are more partitions in the underlying design. Fortunately, we are generally most interested in high parallelism when we deal with large designs.
CONCLUSIONS
We have presented two scheduling algorithms that enable global and detailed routing to be effectively performed utilizing multiple processors. Those algorithms combine constraints that ensure routing quality with constraints that reduce conflicts between processors. Experimental results show that the proposed algorithm significantly reduces runtime on large designs without sacrificing routing quality. 
REFERENCES

