Layout veri cation determines whether the polygons that represent di erent mask layers in the chip conform to the technology speci cations. Commercial layout verication programs can take tens of hours to run in the attened representations for large designs. It is therefore desirable to run the DRC problem in parallel to reduce the runtimes. Also, the memory requirements of large chips are such that the entire chip description may not t in the memory of a single workstation, hence parallel processing allows one to distribute the memory requirements of the problem across multiple processors. In this paper, we will present a parallel implementation of a design-rule checking program called ProperDRC which is implemented on top of the ProperCAD environment. ProperDRC has two novel contributions over previous work. First, it is portable across a large number of multiprocessor platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and hybrid architectures comprised of uni-and multiprocessor workstations connected by a network. Second, ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform DRC operations concurrently on a multiprocessor architecture. This paper presents speci cs of the implementation of ProperDRC, provides an analysis of the methods used to obtain parallelism, addresses load balancing issues, and reports on experimental results on various benchmark circuits.
Introduction
Layout veri cation determines whether the polygons that represent di erent mask layers in a VLSI chip conform to the technology speci cations. One aspect of layout veri cation is design rule checking (DRC) which detects violations of rules such as width, space and overlap rules that govern the technology in which the chip is to be fabricated. The computational complexity of layout veri cation programs is not due to the intrinsic complexity of each operation but to large number of parts in the layout which can consist of tens of millions of rectangles for large designs. The most sophisticated commercial layout veri cation programs such as DRACULA and VAMPIRE from Cadence Design Systems, and CHECKMATE and PARADE from Mentor Graphics can take tens of hours to run in the attened representations for large designs. It is therefore desirable to run the layout veri cation problem in parallel to reduce the runtimes. Also, the memory requirements of large chips are such that the entire chip description may not t in the memory of a single workstation, hence parallel processing allows one to distribute the memory requirements of the problem across multiple processors.
In this paper, we will present a parallel implementation of a design-rule checking program called ProperDRC which is implemented on top of the ProperCAD environment. ProperDRC has two novel contributions over previous work. First, it is portable across a large number of multiprocessor platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and networks of workstations. Second, ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform DRC operations concurrently on a multiprocessor architecture. ProperDRC currently works on Manhattan geometries only (where the edges of rectangles are parallel to the X and Y axes), but conceptually the parallel approaches can be extended to handle non-Manhattan geometries as well since the algorithms for layout operations are all based on scanline algorithms.
The objectives of the ProperCAD project are to develop e cient parallel algorithms for VLSI CAD tasks that can utilize the computing power of a wide range of parallel platforms in order to reduce the design turnaround time of complex chips 1, 2, 3]. We have developed a PoRtable Object-oriented Parallel EnviRonment for CAD algorithms (ProperCAD II), which is a C++ object library targeted at medium-grain parallelism, and MIMD parallel architectures (shared memory, and message passing). Parallel CAD algorithms developed on this library run unchanged, e ciently on both shared memory and message-passing architectures. The di erences from all the previous work on portable parallel programming and our ProperCAD e ort is that we have avoided de ning a new language for writing parallel programs. We have instead used an existing established object-oriented language (C++) as the base language for exploiting the objected oriented nature of programming, and augmented it with an e cient C++ class library to help write portable parallel programs. The ProperCAD II framework runs on shared memory multiprocessors such as the SUN 4/600MP, SUN Sparcserver 1000, the Encore Multimax, and the Silicon Graphics Challenge, and distributed memory message-passing multicomputers such as the Intel iPSC/860 hypercube the Intel Paragon, the Thinking Machines CM-5, the IBM SP-2, on also on a network of SUN workstations. We are investigating parallel algorithms for various VLSI CAD applications on top of the ProperCAD II framework. The applications include cell placement 4, 5] , global and detailed routing, circuit extraction 6], logic synthesis 7, 8] , test generation 9, 10], fault simulation 11], circuit, logic and behavioral simulation, and high level synthesis.
In this paper, we describe parallel algorithms for layout veri cation of attened VLSI layouts using the ProperCAD framework. While some layout veri cation tools exploit the hierarchical information available in VLSI chip designs during the chip design stage while designers are interactively designing a chip, many companies perform a complete attened chip design rule checking just prior to tape-out to avoid the economic penalties of possibly sending out an incorrect layout for costly fabrication 12, 13] . The runtimes of these attened layout veri cation tools can run into tens to hundreds of hours for large commercial designs having tens of millions of rectangles. This is true for commercial tools such as CHECKMATE and PARADE from Mentor Graphics, and DRACULA and VAMPIRE from Cadence Design Systems. It is therefore important to investigate parallel algorithms for layout veri cation.
Another problem of attened design rule checking is its tremendous memory requirements. One can assume that each transistor in a VLSI design translates to about 10-20 rectangles on a mask layout 31]. In order to represent a mask layout, one needs to store the X and Y location, and additional information about the layout mask layer, orientation, etc, which would need about 20-40 bytes per rectangle 31]. For a 10 million transistor circuit representative of current microprocessors, the memory requirements are 8 Gbytes using this simple analysis. The above analysis is for simply representing the layout. During the layout veri cation tasks, additional data structures and temporary data storage is used. Clearly, these memory requirements are too large to t on the memory of a conventional workstation. Using data partitioning, one can partition the memory requirements of the layout among various processors in a parallel machine and enable the execution of these large problems. We will show in the results section of this paper examples of large layouts that cannot run This paper is organized as follows: Section 2 will describe the details of the serial algorithm for design rule checking. Section 3 will discuss related work in parallel design rule checking algorithms. Details of the parallel DRC algorithm and of the ProperDRC implementation are described in Section 4. Performance results for ProperDRC are presented in Section 5. Section 6 contains an analysis of the performance of the parallel DRC. Section 7 summarizes the work in parallel DRC performed in this research.
Serial Design Rule Checking
To guarantee that a circuit can be reliably fabricated, it is necessary to impose a set of design rules on the layout geometry. Figure 1 shows some examples of design rules.
The algorithms presented in this paper use an edge-based representation scheme to describe the layout geometry. The masks of a Manhattan geometry can be represented by horizontal edges only, because the vertical edges can be reconstructed by examining the opaqueness or transparency of the areas above and below each horizontal edge. More details of the algorithms are presented in 14].
While a naive DRC algorithm would check for all possible interactions of all N 2 pairs of rectangles in a design consisting of N rectangles, a data structure called the scanline is useful for performing operations on a geometry that uses an edge representation. The basic idea of a scanline algorithm is to sweep a vertical line across the edges that constitute a mask layer. Each horizontal location the scanline encounters is called a scanline stop. Only the edges that encounter the scanline are considered at a given time. Scanline algorithms can be implemented in a space-e cient manner. An edge in the circuit area is brought in from the global data structure containing N rectangles to an intermediate data structure containing on the average O( p N) edges. An edge is included into the data structure when its left endpoint touches the scanline and is removed from the scanline data structure when its right endpoint touches the scanline 15]. Figure 2 illustrates the basic scanline operation. The scanline stops can be restricted to locations on the circuit area that correspond to the left or right endpoint of an edge.
Task Graph Generation
Advances continue to occur in VLSI manufacturing technology; therefore, a practical DRC tool must have the exibility to accommodate changes in design rules. ProperDRC reads a set of design rules from an input le and then generates a graph of the tasks required to perform the design rule checks. If it becomes necessary to change any of the design rules, only the input le must be modi ed; no changes to the source are necessary.
ProperDRC has the capability to test for violations of any of the following types of rules: Width, Spacing, Enclosure, Overlap, No-Overlap, and Extension. The Width rule is used to specify a minimum feature width for a given layer. The Spacing rule de nes the minimum distance between geometries in two di erent layers, or between two geometries in a single layer. The Enclosure rule is used when one feature must surround another feature by a minimum distance on all sides. The Overlap rule is used when features in a given layer must always be overlapped by another layer. The No-Overlap rule serves just the opposite purpose, and is used to prevent two features in separate layers from occupying the same space. If one geometry feature must extend past the boundary of another feature by a certain minimum distance, the Extension rule is used.
Each of these rule checks is broken into one or more elementary tasks. The elementary tasks used by ProperDRC are Boolean operations between two layers, the Square-Test op- eration, the Grow operation, and Width/Spacing testing. The majority of the rules equate to a single elementary task, as shown in Figure 3 . The circles in the task graph represent the input layers, and the squares represent layers generated by the operation listed, which will contain all of the geometry edges that fail to pass the corresponding design rule.
The Enclosure and Extension rules require a series of three and ten elementary tasks, respectively, to test forrule violations. The task graphs corresponding to these two types of design rules are shown in Figure 4 .
Elementary DRC Tasks
Boolean operations between layers are performed using Lauther's scanline algorithm 15]. The arguments to the function are the two input layers, and the result is an set of edges for the newly formed layer. This operation takes O(N log N) time for N edges. The edges of the newly formed layer must be sorted, so that they may be used as arguments to subsequent tasks. Szymanski and Van Wyk have demonstrated that the natural ordering for the output of a scanline operation can be exploited to perform the sort in O(log N) time 16] .
A special case of the Boolean operation is the paint operation, in which a scanline is passed over a single layer, and no Boolean operation is performed per se. The purpose of the paint operation is to form a set of maximal, nonoverlapping edges. This procedure is necessary before passing a set of edges to the Width/Spacing operation, to remove any edges that may appear inside the interior of a polygon, and possibly cause erroneous clearance violation reports.
The Square-Test operation groups every edge of a given layer into pairs that form squares of a given size. This test is used to verify the size of features in the contact and via layers.
This task is also used to implement the Extension test. This operation takes O(N) time for N edges.
Grow Operation
The Grow operation is performed on a given layer and produces a new set of edges, in which every rectangle is expanded by a speci ed size. A modi cation of Lauther's Boolean mask scanline algorithm 15] is used to perform the grow operation. The implementation of the grow itself actually occurs at the output of the scanline operation, so a Boolean mask operation and grow operation can be performed simultaneously, if necessary. There are two versions of the Width/Spacing test. One takes a single layer as an argument and tests all edges in that layer against other edges in the same layer for minimum width and spacing requirements. The second version takes two layers as arguments and tests all edges in the rst layer against edges in the second layer, and vice versa, for spacing violations. The di erences between these two versions are minor, so a single routine is used to perform both functions. Several optimizations can be used to streamline the clearance checking algorithm. Only the endpoints of an edge must be tested. Ideally, this endpoint only needs to be tested against edges that lie within a circle whose center lies on the endpoint and whose radius is the minimum allowable clearance. For Manhattan edges, the search range can be further reduced by dividing the circle into quadrants. Only edges that lie in one quadrant of the circle require testing for spacing violations. If a Width test is also being performed, a second quadrant of the circle must be tested for width violations.
The ProperDRC Width/Spacing test uses scanlines similar to the ones used in the Lauther algorithm. However, several scanlines must be kept in memory at a time, to test for violations between neighboring edges. Therefore, a new data structure, the Window, is introduced. A Window consists of a set of scanlines. In addition, the Window contains additional storage for edges that lie parallel to the scanline, which must be generated by the Width/Spacing test routine.
The Window is essentially a swath cut through the circuit with a width of twice the maximum design rule interaction distance, (DRID MAX ), which is de ned to be the greater of the minimum spacing distance and the minimum width distance for a given layer. This is used to ensure that a given edge is tested only for width/spacing violations against other edges that lie in close proximity to that edge.
To make the Width/Spacing test even more e cient, it is desirable to compare edges within the Window that are in close proximity to each other. Therefore, the width/spacing routine essentially passes a second Window, perpendicular to the rst one, across the length of the rst Window. The result is that a given edge is tested only against edges that lie in a square whose size is twice the maximum design rule interaction distance on a side. The width/spacing testing can be further optimized by dividing the square into quadrants and restricting the searches to the appropriate quadrants. Details of the algorithms are provided in 14].
Prior Work in Parallel DRC
Several approaches have been explored for parallelizing the design rule checking process in the past by other researchers 17]. We will present an overview of previous work utilizing the following methods: area decomposition on attened circuits, hierarchical decomposition on hierarchical circuits, functional decomposition on attened and hierarchical circuits, and edge decomposition on attened circuits.
Area Decomposition
Bier and Pleszkun have proposed a parallel algorithm that works on the attened representation of mask layouts and uses an area decomposition strategy 18]. The circuit area is divided into subregions that are distributed to various processors, and each processor performs a complete set of design rule checks on its own subregion. The algorithm can work on polygon, pixelmap, or edge-based geometry representations.
Care must be taken when dividing the circuit area into subregions. A cut through the circuit area may introduce errors by dividing geometry features into pieces that do not pass the design rules by themselves. Furthermore, some design rule infractions may go undetected if the o ending features lie on opposite sides of the dividing line. Both of these problems can be alleviated by extending the area of each processor's subregion on all sides by the maximum design rule interaction distance, which is de ned to be the size of the largest constraint placed on the layout for a given technology. Any errors detected within the overlap region are discarded rather than reported.
This work did not speci cally address the issue of load balancing. The circuit was partitioned by equal area regions. In chips with widely varying densities of rectangles, one region can have a large number of rectangles; hence, the speedup would be less than linear. We address this problem in our work.
A second problem is that the above work basically partitioned the chip area in a single dimension, by columns. It is well known that the perimeter of a square is less than that of a rectangle of equal area. Because the larger perimeter translates to an increase in the overlap area between processors, two-dimensional partitioning, as used in ProperDRC, minimizes the total amount of area assigned to each processor.
Hierarchical Decomposition
Unlike a attened VLSI layout representation, in which all of the geometry features of a circuit are explicitly speci ed at all of the mask layers, a hierarchical representation of a VLSI layout groups sets of geometries into a single symbol, which usually represents a single functional unit of some type. Symbol calls can be nested, providing a tool for structured design.
A parallel DRC tool has been developed by Gregoretti and Seagall that takes advantage of the hierarchical representation of the circuit 19]. A generalized data type, called the token, is introduced to represent either a single geometry feature or a collection of features grouped into a symbol. The design rule checks are performed on the tokens themselves. When two tokens overlap, new tasks are generated in which one token is tested against all of the tokens represented by the second token, if it represents a symbol, or against the single feature the second token represents. The process is parallelized by having all processors take tasks from, and add tasks to, a common task queue.
This approach exploits parallelism only at the level of cells. If there are fewer cells in the design than processors in the multiprocessor, or if the cells have widely di erent sizes, there can be a load balancing problem. Also, this approach is not applicable to attened circuit descriptions since there will only be a single task.
The other disadvantage with this approach is that it is not memory scalable. If the edge-based representation of a circuit is too large to t in the memory of a single processor, the multiprocessor will not be able to operate on the circuit.
Task Partitioning
Task partitioning of the DRC process relies on the fact that a design rule check does not entail the execution of a single algorithm, but instead requires the sequential execution of many computationally independent algorithms. The goal of task partitioning is to perform the computations necessary for separate rule checks simultaneously on di erent processors, while at the same time not duplicate the computations that contribute to the checking of more than one rule.
Marantz developed a system that provided a general method of controlling the execution of any program that can be divided into a nite set of tasks 20]. This system was applied to the DRC problem by distributing the design rules to the various processors, where each processor applies its subset of rules to the entire circuit area.
It should be noted that the task parallel approach is the easiest to incorporate into a large piece of layout veri cation software, since one can partition the rules among di erent DRC runs on di erent processors. This is the approach used in a commercial version of a parallel DRC called DRACULA from Cadence Design Systems which runs on networks of workstations and on shared memory multiprocessors such as the SPARCServer 1000.
We will show in the results in Section 5 that pure task parallelism produces limited speedups since there is not enough task parallelism in real design rules, hence such an approach is appropriate for a small number of processors, e.g. 4 to 8. Therefore, the approach is not scalable. This method of parallelization also su ers from the same memory scalability problem as the previous approach, in that each processor must have enough memory to perform operations on the entire circuit area.
Edge Partitioning
Carlson and Rutenbar have developed an algorithm in which all scanline stops are generated at the start of the checking and then processed in parallel 21, 22] . It is necessary to decompose the circuit geometry into a completely intersected set of edges so that the set of all edges crossing a given scanline is immediately available. Boolean operations between layers, the determination of electrically connected sets of geometries, and checking for width, spacing, and extension violations are all performed in parallel on the scanlines.
This approach is applicable only to a single type of architecture, namely SIMD data parallel computers. This algorithm, therefore, is not appropriate for many of the powerful parallel machines available today.
A New Approach to Parallel DRC
The serial design rule checking algorithm presented in Section 2 can be parallelized in two ways: First, the circuit area may be divided, and design rule checks performed on the subregions simultaneously; second, the series of elementary DRC tasks necessary to perform the checks for the various rules can be divided between processors.
These two methods of parallelizing the design rule checking process are completely independent of one another. Therefore, the data and task parallelism can be considered orthogonal axes of parallelism, in which exploitation of one of the two types of parallelism, or both simultaneously, will result in performance gains.
Data Parallelism
In ProperDRC, data parallelism is achieved by dividing the circuit in two dimensions into subregions and distributing the subdivisions of the circuit geometry between processors, or clusters of processors, depending on whether or not task parallelism is being implemented simultaneously. For the purposes of discussion in this section, let us consider the issues of implementing data parallelism by itself. Task parallelism, and a combination of the two types of parallelism, will be discussed in later sections.
The data partitioning scheme used in ProperDRC uses the number of rectangles assigned to a given processor as an estimate of workload. It should be noted that the actual amount of computation performed by a processor depends on the exact DRC checks performed on the rectangles within a region (see Section 6 for analysis of computations for various checks taking between O(N) and O(N log N) time for N rectangles). Partitioning the circuit by assigning equal area regions to each processor does not necessarily produce a balanced load, since the distribution of geometry features within the circuit area may not be uniform.
Several researchers have worked in the area of load balancing and partitioning of points in two dimensions 23]. Salmon 24] has proposed the use of the Orthogonal Recursive Bisection (ORB) scheme for solving the N-body problem 25, 26] . Cybenko has reported on a scheme for recursive decomposition of workload in a multiprocessor 27]. Belkhale and Banerjee have proposed an alternate recursive partitioning algorithm for partitioning a set of points on a multiprocessor 28], and have reported implementations of this scheme in the context of a parallel circuit extractor 29, 30] . All of the above partitioning methods are fairly complex to implement e ciently.
ProperDRC utilizes a data partitioning strategy based on a scheme proposed by Ramkumar and Banerjee 6] for parallel circuit extraction. The decomposition is performed by repeatedly subdividing the circuit area to produce subregions of equal area. The subdivision continues until all processors have equal areas of the circuit geometry, and may continue further to facilitate load balancing.
The physical layout description is read from o ine storage in the Caltech Intermediate Form (CIF) representation 31]. Rectangles are distributed to the corresponding processors in batches, so that it is never necessary to keep the entire circuit description in the memory of a single processor. The multiprocessor architecture can therefore operate on a circuit area that is too complex to t into the memory of an individual processor.
The capability to perform layout veri cation on a circuit area that is too large for a uniprocessor is one of the most important advantages of performing design rule checking in parallel. The drawback associated with this method is that, because the entire circuit is never in the memory of a single processor at one time, no global quanti cation of the distribution of geometry features within the circuit area is possible. For this reason, circuit partitioning must be based purely on circuit area rather than dividing geometry features themselves equally among processors.
To balance the load between processors, additional decomposition is performed to further subdivide the chip area. A grainsize is speci ed by the user to limit the amount of additional decomposition performed. All areas that contain an amount of geometry features greater than the speci ed grainsize are subdivided. Circuit geometry regions are then remapped to processors in such a way as to provide the best load balancing.
Choosing the optimal grain-size is a hard problem since, in general, the variation of the runtimes of a parallel DRC tool for varying grain-sizes will have a bath-tub characteristic. If the grain size to too large, we will get unequal load balance. Hence the runtimes of a parallel DRC program will be large for layouts containing irregular distributions of rectangles. If the grain size is too small, we will create a large number of tasks, but each task will generate some redundant work in the form of extra checks that are needed to be performed at the boundaries of the partitions (see Section 6.2 for a detailed analysis). Again, the runtimes of the parallel DRC tool will be large for small grain-sizes. For an optimum grain-size, the runtimes of the parallel tool will be minimum. Since the distribution of rectangles of a circuit are not known a priori, it is impossible to optimally determine the optimal grain-size for all layouts. We will discuss experimental data on the choice the the grain size for some example layouts in Section 5. A good heuristic is to choose a grain-size of around N= P rectangles, where N is the number of rectangles, P is the number of processors, and is the variance of the distribution of rectangles per unit area of the chip. We assume a typical value of to be 2 for real designs. Figure 5 (a) shows the initial circuit partitioning on four processors for a sample circuit area, in which the X's represent geometry features. The dashed lines show the initial division of the circuit into equal-area subregions, which are assigned one per processor. Figure 5 (b) shows the same circuit after the load balancing algorithm is applied with a speci ed grainsize of 5. The region initially assigned to Processor 2 has been divided into two parts. Processor 3 will perform the DRC checks on the subregion on the right, to maintain better overall load balance.
Note that if a grainsize of 10 had been speci ed by the user, no further subdivision of the circuit beyond the initial partitioning shown in Figure 5 extent to which load balancing takes place is therefore completely under the user's control, which gives the user the exibility to customize the performance of the algorithm to take full advantage of the multiprocessor architecture by selecting an appropriate grainsize. The geometry layers are distributed in the original polygon representation used by the CIF input le. The conversion from polygon to edge-based representation takes place at the clusters that will perform the DRC tests on the area. Delaying the conversion until after the partitioning has two advantages: the messages are smaller, because one rectangle expands to two edges, and the conversion work is distributed to reduce the amount of time required.
Some overlap is necessary between the areas assigned to the various processors to ensure that no pairs of neighboring edges are overlooked. Each processor will receive all rectangles that lie within its area extended on all sides by the maximum design rule interaction distance for the technology. Figure 6 shows the partitioned circuit area from Figure 5 (b) with the addition of the overlap areas. Rectangles that are present in more than one processor area will be duplicated and trimmed to the respective processor areas.
Trimming the rectangles can easily introduce geometry features that do not pass the design rules. The DRC routine must be careful not to report erroneous results introduced by the circuit partitioning. Therefore, upon completion of the design rule tests, infractions that fall within the maximum design rule interaction distance boundary surrounding the processor's area of the circuit are disregarded rather than reported. 
Task Parallelism
Task parallelism is achieved by having a group of processors share a single region of the circuit area and divide among themselves the elementary tasks necessary to perform the various DRC tests. If pure task parallelism is desired, the group of processors will actually be the entire set of processors in the multiprocessor architecture, each of which will divide up the DRC tasks for the whole circuit area. When a combination of data and task parallelism is used, the group of processors will represent an individual processor cluster inside the multiprocessor. Let us refer to the group of processors sharing the DRC tasks for a given region of the circuit as a cluster, without loss of generality, for the purpose of the following discussion. The task graph generated by the serial DRC algorithm is usable for the parallel DRC as well. The ideal parallel implementation would dynamically schedule the tasks, when upon the generation of a layer, all of the subsequent tasks that utilize that layer would be spawned on currently idle processors. However, such an implementation is not feasible, due to the dependencies in the task graph. Figure 7 (a) shows a sample task graph corresponding to a Square Test on the via layer, an Enclosure check on the via and metal2 layers, and a Width/Spacing test on the poly layer.
Elementary DRC operations such as the Boolean mask operations and Width/Spacing tests described in the previous section may have two input layers. These two layers will be generated by other tasks, which precede the current task in the task graph. It is conceivable (maybe even desirable, from a performance standpoint) that the two parent tasks run on separate processors in the cluster. The layers generated by both these parent tasks must be sent to a single processor, so that the subsequent task can be completed. Therefore, it is necessary that the destination processor for the generated layers, and thus the child task itself, be determined a priori. Other solutions to the dependency problem, such as broadcasting the EdgeSets or having the child task explicitly request the layer from the parents, introduce too much communication overhead to be e ective. The mapping of tasks onto processors is obtained by levelizing the task graph. Priorities are assigned to tasks based on the number of levels of subsequent tasks that depend on the output layers. The levelized task graph is lled by arranging tasks in prioritized order. Figure 7 (b) shows how the tasks can be tagged with a priority and mapped to a cluster that contains two processors.
A task can begin as soon as its input layers arrive at the destination processor. There is no need for explicit synchronization between all of the processors of the cluster at the task graph level boundaries, so the penalty for load imbalance is not as severe as with the traditional barrier-synchronized implementation of a levelized task graph. Furthermore, because the complexities of the various algorithms used to implement the elementary tasks can be used to estimate the length of time required to perform the operations for a given problem size, the potential exists for some intelligent scheduling methods to minimize the imbalance between processors.
Because the number of tasks necessary to perform the design rule checks is xed for a given technology, and is independent of the problem size, there is an upper bound on the performance that can be achieved by parallelizing these tasks, no matter how e ective the load balancing strategies are. This would suggest task parallelism alone is not su cient to obtain the best performance on a multiprocessor architecture with more than a few processors; a combination of data and task parallelism must be used.
Combination of Task and Data Parallelism
In the case in which a combination of data and functional parallelism is used, clusters of processors are assigned regions of the circuit area, and processors within the cluster perform the DRC tasks in parallel. Two separate load balancing issues must be addressed: The elementary DRC tasks must be divided equally between the processors in each cluster, as discussed in the previous section, and the load must be balanced between the various clusters.
To balance the workload between processor clusters, a separate strategy is introduced. The same initial partitioning method is used as in the purely data parallel version, but the partitioning is done at the cluster level, rather than the processor level. The cluster estimates its own relative need for processing power, based on the number of geometry features inside its circuit region as a fraction of the total number of geometry features in the circuit. A fraction of the total number of available processors is then assigned to the cluster, based on this ratio.
The methods used to select which processors are apportioned to a given cluster can be customized to take advantage of physical locality in a given processor architecture. The end result is that a cluster with a lower workload \loans" one of the processors in its cluster to an overburdened cluster. In this way, more resources are applied to the more dense regions of the circuit area to improve the overall execution time. Figure 8 shows an example of how the load balancing scheme is applied, on an imaginary multiprocessor architecture with eight processors, arranged as four clusters of two processors. The partitioning of the circuit area between the clusters is shown in Figure 8(a) . In the absence of the load balancing scheme, these circuit regions are assigned to each of the four homogeneous clusters, as depicted in Figure 8(b) , where the circles represent individual processors, and the lines connecting them show the structure of the architecture. Figure 8(c) shows the processor-to-cluster mapping after application of the load balancing scheme. Cluster 3 has essentially borrowed an extra processor from Cluster 2 to compensate for the larger number of geometry features in its region of the circuit area.
Results
ProperDRC was used to test for violations of the MOSIS Scalable CMOS design rules 32]. A total of 32 design rules were speci ed, which resulted in the generation of 64 intermediate layers to perform all of the necessary tests. The following platforms were used to generate performance measurements: a Sun Sparcserver 1000 shared-memory multiprocessor, a network of six Sun Sparcstations, and the CM-5 message-passing distributed-memory multiprocessor.
The benchmarks used to test ProperDRC include plapart, a programmable logic array with 25,000 rectangles; kovariks, a multiplier array with 64,000 rectangles; and haab1 and haab2, static RAMs containing 128,000 and 253,000 rectangles, respectively. An arti cial benchmark, superhaab, was also created, which consists of the haab2 benchmark replicated four times, in array of two cells by two cells, with 10 spacing between cells. Superhaab contains 1,014,000 rectangles. Tables 1 through 3 show the performance data for purely data parallel decomposition of the DRC. All execution times are measured in seconds. Dashes in the tables indicate that the processor con guration had insu cient memory to perform the DRC on the given circuit. The fact that the CM-5 was unable to operate on the haab1, haab2, and superhaab circuits with less than 8, 16, and 64 processors, respectively, illustrates the memory scalability of the ProperDRC algorithm. The results of the network of SUN workstations for very large circuits could not be reported since our ProperCAD library implementation on the network is unreliable for very large message sizes. (In other related work, we are working on a reliable port of the ProperCAD enviroment on a network of workstations.)
It is also interesting to note that every platform appears to exhibit superlinear speedups as the number of processors increases from one to two. This is especially apparent the larger benchmarks running on the Sun Sparcserver 1000, which run six to seven times faster on two processors than on a uniprocessor. This e ect is most likely due to cache e ects, where the smaller working space requirement of the two-processor implementation results in The performance results for the purely task parallel implementation of ProperDRC are given in Tables 4 through 6 . A small number of processors provide good performance results, but the e ectiveness of adding additional processors diminishes quickly, for any problem size. This is because the amount of task parallelism available is dependent only on the size of the set of technology rules being used, and not the size of the input le, as discussed in Section 4.2. It should be noted that task parallel layout veri cation cannot handle large problem sizes since each processor has to replicate the entire mask layout, which becomes too much for each processor. Tables 7 and 8 show the performance results using a combination of data and task parallelism. It is important to notice that there are cases in which a combination of data and task parallelism provides better performance over either type of parallelism individually. Compared to Table 3 , the results on the 128 processor runs show that the combined task 
and data parallel gives better runtime performance than the purely data parallel approach. A detailed analysis of these results is presented in the following section. The user-speci ed grainsize controls the extent to which load balancing takes place in the purely data parallel decomposition of the DRC problem. Any region of the circuit having a number of geometry features greater than the grainsize is subdivided into equal area regions, which may later be reassigned to di erent processors as necessary to facilitate load balancing. As discussed earlier in Section 4, choosing the optimal grain-size is a hard problem. If the grain size to too large, we will get unequal load balance. If the grain size is too small, we will create a large number of tasks, but each task will generate some redundant work in the form of extra checks that are needed to be performed at the boundaries of the partitions. A good heuristic is to choose a grain-size of around N= P rectangles, where N is the number of rectangles, P is the number of processors, and is the variance of the distribution of rectangles per unit area of the chip. We assume a typical value of to be 2 for real designs. The e ect of varying the grainsize for the purely data parallel decomposition is shown in Table 9 for the CM-5. The purely area-based circuit partitioning may be considered a degenerate case of the data partitioning strategy presented in this paper, in which the grainsize is an in nite value since the grain-size based partitioning is not invoked. For the haab1 circuit consisting of 128,000 rectangles, we show results of varying grain sizes for 5,000 and 1,000 rectangles. For example, for the 16 processor run, we show that the results are optimal for 5000 rectangles (our heuristic picks 4,000 rectangles). Similarly for the haab2 circuit consisting of 256,000 rectangles on 16 processors, the optimal grain size is 10,000 rectangles (our heuristic picks 8,000 rectangles). We have obtained similar results on the SUN Sparcserver 1000 and network of workstations.
The concept of using task priorities to determine the order of execution for a set of DRC tasks was introduced in Section 4.2. Tasks are assigned higher priorities based on the number of levels of subsequent tasks that rely on the output of the task. Using an arbitrary ordering for tasks could result in a task schedule that produces more tra c and requires more waiting time than the prioritized schedule. Table 10 shows the e ect of using the priorities to schedule tasks, as opposed to using random ordering. The performance gures are reported for the network of Sun workstations, but we obtained similar results on the Sun Sparcserver and the CM-5. The choice of a task ordering heuristic has no e ect on the uniprocessor performance, as expected, because network tra c and processor idle time are not relevant concerns for uniprocessor execution. With two or more processors, the performance data illustrate that the prioritized task queue provides better performance.
A cluster remapping strategy was presented in Section 4.3 as a means of balancing the load between clusters when a combination of data and task parallelism is used. Ideally, the number of processors assigned to a given cluster is proportional to the number of geometry features inside that cluster's region of the circuit area. However, because the total number of processors is xed, and sometimes small, the fraction of the available processors assigned to a cluster cannot always equal the exact fraction of the total number of geometry features that lie within the circuit area owned by the cluster. Having a larger number of processors available allows the fraction of the available processors assigned to the cluster to more closely approximate the fraction of geometry features in the cluster area and, therefore, facilitates more e ective load balancing. Table 11 shows the e ectiveness of the cluster remapping strategy. The load balancing strategy was most e ective with a large number of processors on the CM-5, where the processor-to-cluster mapping has the most exibility. To analyze the performance results of ProperDRC, we will rst examine the performance of the serial algorithms used to implement the various DRC operations. We will then proceed to examine the performance issues introduced by the parallelization of the DRC process.
Analysis of Serial DRC
ProperDRC uses the scanline operation developed by Lauther to perform Boolean mask operations 15], the edge sorting algorithm presented by Szymanski and Van Wyk 16] , and the width/spacing clearance checking algorithm presented in this paper. Table 12 provides a summary of the complexities of the various algorithms used to perform the DRC operations.
Considering that the overall performance of the DRC is bounded by the performance of the most complex algorithms used by the DRC, the overall complexity for the DRC is O(NlogN). 
Analysis of Parallel DRC
The performance results demonstrate that both data parallelism and task parallelism can be applied to the DRC problem to achieve better performance and reduced memory requirements as compared to serial algorithms. Because neither of the two types of parallelism adversely impacts the e ectiveness of the other, a combination of the two types of parallelism can be applied to achieve further parallelism. In practice, the ultimate goal is to achieve the best performance given an existing architecture. Table 13 shows some of the performance results measured on the CM-5 from the previous chapter, rearranged to show a comparison between using pure data parallelism and using a combination of data and task parallelism on a given number of processors. In the case of the combination of data and task parallelism, the number of processors per cluster given in the table is an average value; the processor-to-cluster mappings may be modi ed to balance the load between clusters, as discussed in Section 4.3.
The data in Table 13 show that neither of the two parallelization approaches is superior to the other in all cases. There are two counteracting factors that a ect the relative performance of the two algorithms: The task parallel performance is limited by the complexity of the DRC algorithms, and the data parallel performance is limited by the extra work created by the overlapping processor areas. To illustrate the e ect of the complexity of the DRC algorithms on the task parallel performance, consider a simpli ed case in which two processors are to be applied to perform a design rule check on a circuit with 2X geometry features. Assume perfect load balancing, whether data or task parallelism is used.
No matter which type of parallelism is used, the combination of the two processors must have the memory capacity to hold the entire circuit. Using the complexity of the slowest algorithms in the DRC procedure, the time necessary to perform the DRC on a circuit of problem size N is O (N log N) , and the minimum amount of working space required for the DRC algorithms is O( p N). We will use the term working space to distinguish between the amount of storage required by a single processor to perform the various DRC operations, and the amount of storage required by the entire set of processors performing the DRC operations to hold the whole circuit geometry, which is xed at O(N) for the entire set of processors.
If data parallelism were used to divide the circuit's geometry features equally between the processors, neglecting the overlapping processor areas for the moment, each of the processors would perform a DRC on a subregion of the circuit with X geometry features. The total run time would be O(X log X), because this amount of time is necessary for each processor to perform local design rule checking simultaneously. The demand for work space at each of the processors is O( p X).
In the task parallel implementation, the various DRC tasks would be divided equally between the two processors. Both processors would be working on a problem size of 2X. The time required for the DRC would be O(1=2 (2X) log (2X)) = O(X log (2X)). The working space requirement for each processor would be O( p 2X). Therefore, the task parallel version of the DRC requires slightly more time to run and more working space. These penalties are O(constant), but nonetheless indicate that the purely data parallel implementation provides the better performance when overlapping processor areas are disregarded. Now, let us take the e ect of overlapping processors into consideration. Note that the actual area assigned to processors whose regions are on the outside boundaries of the circuit is actually slightly lower. These slight area discrepancies can be safely ignored, because a smaller fraction of the processors are on the boundary as the total number of processors increases, and the performance of the algorithm on the circuit as a whole will be bounded by the processors with the highest areas. Disregarding these area discrepancies, the total area operated on by the set of P processors is A + kc p AP + 4c In addition to the penalty for the complexity of the DRC algorithms associated with the task parallelism, there is also the overhead of intraprocessor communication, whereas the purely data parallel decomposition of the problem requires no communication while the DRC checks are being performed, although some communication is necessary during the data partitioning phase to perform the load balancing between processors.
The experimental results also show that load balancing is more di cult to attain for the task parallel implementation as compared to the data parallel implementation. Consider that the task parallel DRC on the plapart benchmark on the network of Sun workstations took longer with ve processors than with either four or six. The actual distribution of the geometries between the di erent mask layers is much more critical for the task parallel implementation than for the data parallel version. The particular distribution in the plapart benchmark apparently presented a load balancing problem for the particular order in which the layer operations were divided between ve processors.
Conclusion
In this paper we have have applied the concept of integrating task and data parallelism in an irregular application, namely VLSI layout veri cation in a tool called ProperDRC. ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform design-rule checking (DRC) operations concurrently on a multiprocessor architecture. Another contribution of the parallel application is that it is portable across a large number of parallel platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and networks of workstations.
A number of areas in parallel design rule checking should be explored in the future. Ideally, a DRC tool should be able to exploit the hierarchy of large designs. Performing DRC on a attened layout representation may result in much redundant work if individual cells in the design are instantiated a large number of times, as is often the case with library cell-based designs.
ProperDRC should be expanded to handle non-Manhattan layout geometries. Many of the algorithms used in ProperDRC would require some additional work to be capable of operating on non-Manhattan designs.
When such increased capabilities are included into ProperDRC, we can perform an e ective comparison of the runtimes of ProperDRC versus commercial layout veri cation tools such as DRACULA and VAMPIRE from Cadence Design Systems, and CHECKMATE and PARADE from Mentor Graphics. Conceptually, the approaches of combined task and data parallelism should be applicable to any commerical layout veri cation tool. But the exact nature of the performance gains will be dependent on the actual implementation. We are in the process of interacting with developers at Cadence to transfer the parallel algorithms in ProperDRC into practice 13].
