Simulated annealing based standard cell placement for VLSI designs has long been acknowledged as a computeintensive process, and as a result several research efforts have been undertaken to parallelize this algorithm. Parallel placement is most needed for very large circuits. Since these circuits do not fi t in memory, the traditional approach has been to partition and place individual modules. This causes a hit in placement quality in terms of area and wirelength.
Introduction

Motivation
With the rapid advances in VLSI process technology, circuit design is becoming increasingly complex and in turn is placing ever higher demands on CAD tools. Designs containing millions of transistors are typical, and it is expected that designs will approach 100 million transistors by the end of the decade. The computational resources needed to effectively design these circuits are enormous. Each of the different phases in the VLSI design process can take several hours to several days using existing CAD algorithms in current process technologies. As the size of these designs grow, the CAD tools become increasingly taxing on the memory resources of computers. As a result, with many modern designs, it is not possible to effectively use existing CAD tools on the entire design because of memory shortage.
Parallel processing is fast becoming an attractive solution to reduce the inordinate amount of time spent in VLSI circuit design. This fact has been recognized by several researchers in VLSI CAD as evidenced in the recent literature for cell placement, fl oor planning, circuit extraction, test generation, fault simulation, logic synthesis, etc [3] . Parallel processing can also address the memory issue by using the distributed memory resources on a multiprocessor.
In this paper, we examine one phase of the design process in detail, namely, standard cell placement. Placement of standard cells is particularly expensive because of the inherent compute intensive nature of simulated annealing, one of the more popular approaches used for cell placement. There have been several attempts to parallelize this algorithm, usually with quality results that do not compare to the best available sequential algorithm, or with speedups that are not acceptable.
Most previous work in parallel placement has minimized area and wirelength, but in current submicron designs (0.25 micron and less), wirelength delay is more important. The algorithm discussed in this paper is the fi rst parallel algorithm to report results for timing driven placement. For current high density circuits, minimizing area is no longer suffi cient. The delays associated with the wiring elements are more critical to the performance of the circuit; thus, steps must be taken to minimize these delays. Timing driven placement is the process of simultaneously minimizing the circuit area as well as minimizing the critical path delays. We have used a very accurate Elmore delay model which is more compute intensive and hence the need for parallel placement is more apparent.
In addition, parallel placement is most needed for very large circuits. Since these circuits do not fi t in the memory of standard workstations, the traditional approach has been to partition and place individual modules. This causes a hit in placement quality in terms of area and wirelength. In addition, we cannot do this for timing driven placement since paths go across partitions. Our algorithm is circuit partitioned and can handle arbitrary large circuits on distributed memory multiprocessors. This circuit-partitioned approach provides speedups to larger numbers of processors with little loss of quality.
Because of the wide variety of parallel architectures and programming models, there has been a signifi cant amount of work in creating a standard interface for writing portable message passing software. Among these are the Message Passing Interface (MPI) [33, 34] , Parallel Virtual Machine (PVM) [19] , and p4 [9] . MPI seems to be gaining support as the de facto standard for writing message passing code. It is an inclusive standard that supports virtually all styles of send-receive communication, group protocols, reductions, and noncontiguous data structures. In light of the popularity of MPI, we have implemented the circuit partitioned timing driven placement algorithm using the MPI protocol.
The remainder of the paper is organized as follows. The following section covers some of the background and previous work. Section 3 describes the circuit partitioned parallel algorithm for placement and Section 4 covers the algorithm for timing driven placement. Section 5 presents results and we conclude in Section 6.
Background and Related Work
Standard cell based design methodology allows a designer to build his or her design from a library of predefi ned modules or cells. The placement problem involves placing these cells on a VLSI layout, given a netlist that provides the connectivity between each cell and a library containing layout information for each type of cell. This layout information includes the width and height of the cell, the location of each pin, the presence of equivalent pins, and the possible presence of feed through paths within the cell. The primary goal of cell placement is to determine the best location of each cell so as to minimize the total area of the layout, the length of the nets connecting the cells together, and the delays in the critical path of the design. With standard cell design, the layout is organized into equal height rows, and the desired placement should have equal length rows.
Simulated Annealing
One of the more powerful algorithms for standard cell placement has been simulated annealing. Other comparable placement algorithms include quadratic based methods such as PROUD and GORDIAN [29, 47] . We have developed parallel algorithms for these methods in other work [49] . Simulated annealing is an iterative optimization strategy that starts with a system in a disordered state, and through perturbations of the state, brings the system gradually to a low energy, and thus optimal, state [27, 37] . The energy is a cost function of the system that is to be minimized. In the context of cell placement, perturbations are simply moves of the cells to different locations on the layout, and the energy is an approximated layout cost function.
As moves are made, any move that reduces the cost function is accepted. However, simulated annealing, unlike greedy algorithms, will also allow moves that increase the cost. The effect of this change is to allow the solution to escape from local minima. In cases where the cost is increased, the new state is accepted with probability
where is the change in the cost or energy and is the temperature of the system. The temperature is an analog of the effect of temperature in crystal annealing. We start with an extremely high temperature to allow nearly all moves to be accepted. Gradually, the temperature is reduced until a termination condition is reached.
Theoretical studies show that simulated annealing is guaranteed to reach an optimal solution given enough time and proper monitoring of the temperature or annealing schedule. To achieve this, at each temperature, the system must be at equilibrium before the temperature is lowered again. In computing applications, it is impractical to wait for the system to achieve equilibrium before changing the temperature, so heuristics are used to develop a fast and near optimal schedule [1, 21, 26, 31] .
TimberWolfSC
One of the better known implementations of simulated annealing for placement has been the TimberWolfSC cell placement tool [39, 41, 51] . The TimberWolfSC cost function is defi ned in Eq. 
Moves are generated by choosing a random cell and then displacing it to a random location on the layout. If a cell is already present at the new location, the two cells are exchanged. A temperature dependent range limiter is used to limit the distance over which a cell can move. Initially, the span of the range limiter is set such that a cell can move anywhere on the layout. Subsequently, the span is decreased logarithmically with temperature. These range limiter updates are made at the end of each of the 160 iterations into which TimberWolfSC segments the simulated annealing procedure. As the algorithm progresses, the temperature is gradually decreased by forcing the acceptance rate to follow a theoretically derived schedule that attempts to keep the acceptance rate close to 44% during the middle region of annealing [31] . TimberWolfSC 6.0 also uses row bins to aid in the computation of overlap and row penalties, and early rejection methods are used to speed up the decision process [40] .
Timing Driven Placement Algorithms
The placement algorithm discussed so far deals only with wirelength minimization and indirectly area minimization.
For current high density circuits, this cost function is no longer appropriate. As VLSI designs grow and feature sizes shrink, overall circuit performance becomes more closely related to the interconnect timing characteristics. Timing driven placement is the process of simultaneously minimizing the circuit area as well as minimizing the critical path delays. During the physical design process, a common heuristic used is to minimize total net wirelength as an approximation of the area. Minimizing the wirelength of a net would seem to minimize the effect of interconnect delays.
However, timing is not determined solely by the delays of individual nets, but instead by a sequence of nets or a signal path. Moreover, only nets on the longest or critical paths in the circuit are of concern.
There have been two general approaches to timing driven placement -net based [18, 32, 52] and path based [14, 22, 24, 42, 45] . Net based algorithms identify critical paths a priori and assign criticality weights or upper bounds for each net in the path, and then guide the placement process based on these bounds. The pre-timing analysis may not be able to effectively select measures for each net; thus, the placement quality may suffer. Path based approaches address this problem by doing complete path delay analysis during placement. This is, of course, at the cost of an increase in computational complexity.
The other critical component of timing driven placement is the choice of the delay model to approximate interconnect behavior. To simplify calculation, most previous approaches have used a basic delay model. The simplest models are based on the assumption that pin to pin delay is proportional to the net wirelength [18, 44] . Other more detailed models use a simplifi ed RC model where wire capacitance can be computed either as a function of fanout or proportional to wire length [14, 22, 24, 45] . These models do not take into consideration wire resistance, which is becoming more and more important as designs scale down to submicron features. Additionally, these models do not account for driver pin location or the delay characteristics of a distributed RC tree.
Parallel Annealing Algorithms for Placement
Because of the inherent computational costs associated with simulated annealing, several methods have been proposed for the parallelization of the procedure [15] . Using the taxonomy defi ned in [20] , there are three major classes of parallel simulated annealing algorithms: serial-like, asynchronous, and altered generation.
Serial-like algorithms preserve the convergence characteristics of the sequential algorithm through the use of single move acceleration or serializable subsets. Kravitz and Rutenbar have investigated both approaches and found that these algorithms have limited parallelism and are more appropriate for shared memory architectures [30] .
The second class of parallel simulated annealing techniques, altered generation, is distinguished from serial-like algorithms in that they do not follow the exact search space laid out by the sequential algorithm. This is usually accomplished with a processor or group of processors either exploring a restricted state space or using a restricted search on the entire state space. To ensure proper global convergence, the global state is kept up to date through periodic solution exchanges or with a shared memory architecture. Parallel placement algorithms using this strategy for shared memory machines include work by Darema et al. [13] and Natarajan and Kirkpatrick [35] . Sun and Sechen have recently shown results achieving near linear speedup on a network of workstations [43] . This method shows great promise for a few processors with large amounts of memory, but it is does not adequately address very large circuits distributed across processors with lesser amounts of memory.
The fi nal class of parallel simulated annealing algorithms is the asynchronous or " parallel moves" algorithm where each processor generates and evaluates moves independently. This differs from altered generation methods in that the state space is not restricted. In other words, each processor contains information on the entire circuit regardless of whether the global layout information is accurate in the local processor. Obviously, the cost function calculations may be incorrect because of the moves made by the other processors. There are various methods to address the effect of error, but all involve some form of periodic updates. The number of updates is directly related to the average acceptance rate of the particular annealing scheduling chosen. Banerjee, Jones and Sargent [4] implemented a parallel placement algorithm using the parallel move approach on an Intel hypercube multiprocessor and proposed several partitioning strategies for the problem specifi c to the hypercube topology. Speedups of up to 12 on 16 processors were reported. Rose et al. [36] proposed a parallel algorithm on an experimental distributed memory multiprocessor. In that algorithm, they replaced the high temperature portion of the parallel simulated annealing placer with a placement program based on a min-cut algorithm and used a parallel moves strategy for lower temperatures. Speedups of 4 on fi ve processors were reported.
The only reported instance of large scale parallelism being applied to cell placement is the use of parallel moves for SIMD machines. Both Casotto and Sangiovanni-Vincentelli [11] and Wong and Fiebrich [48] have presented similar parallel simulated annealing placement algorithms for the SIMD Connection Machine. These methods fall in between a completely asynchronous approach and the altered generation methods. By completely distributing the circuit state, the necessity for global updates is removed, while still allowing for asynchronous parallel moves.
Most previous approaches to parallel placement have tried to minimize area or wirelength but with current designs performance driven placement has become more critical. Our paper is the fi rst to address the issues of performance driven placement and also to do that in a circuit partitioned manner.
Circuit Partitioned Approach
As circuit sizes increase, it becomes more and more infeasible to adequately address VLSI CAD problems on a single processor as a result of inordinate memory requirements. This is especially true in the area of cell placement where design sizes are approaching 100,000 cells and more. In the past, designers have partitioned the circuit into more manageable subcircuits and then suffered a loss of quality on the placement of the subdivided circuit. In this section, a method is presented that takes advantage of the large memory resources spread across multiple processors in a parallel processing machine and still achieves faster placement times without loss of quality. The implementation of this circuit partitioned parallel algorithm developed using MPI is called mpiPLACE.
Data Structures and Distribution
We have developed the mpiPLACE algorithm based on the TimberWolfSC 6.0 program implementation. The latest version of TimberWolfSC, 7.0, could not be used since the source code was not available. The concepts of parallelization, though, will hold for the newer version as well. To understand the parallelization procedure some further explanation of the data structures used in TimberWolfSC 6.0 is necessary. The circuit information is described primarily with the use of three arrays -the list of cells, the list of nets, and fi nally an array describing row information.
Each cell data structure contains positional information as well a linked list of pins that belong to the cell. Likewise, each net data structure has bounding box information as well as a linked list of pins that belong to the net. The pin data structures are shared by both the cell and net linked lists.
The circuit is read in on a single processor, and as each cell and net is read in, the associated data structures are distributed to the other processors. The process of determining which cells and nets are assigned to which processor is done using a prepartitioning phase. There are two primary concerns in our partitioning. First, the load balance must be maintained, i.e. each partition should have roughly the same number of cells. Secondly, the number of nets cut should be minimized to decrease the interaction between partitions. The necessity of these requirements will become clear in the following section on parallelization. Ratio cut partitioning methods have long been used in the CAD community because of their effectiveness at reducing the cut size. However, these methods are inappropriate for our use because they do not provide well balanced partitions. We have instead used a partitioning algorithm based on the Sanchis modifi cation of the Fiduccia-Mattheyses algorithm [17, 38] . Graph partitioning methods such as recursive spectral bisection or METIS may alternatively be used [5, 25] . The pads are treated as a special case and are not partitioned but instead placed on one processor. Since the number of pads is small relative to the number of cells in a circuit, keeping the pads on one processor does not have a signifi cant effect on memory usage.
Parallel Algorithm
Once the circuit has been read in and distributed, the annealing procedure can begin. Each processor will then perform simulated annealing on its partition of the entire circuit. Since the circuit has now been partitioned, moves are only attempted on cells available on the local processor. When a request is made to evaluate the cost of a cell move, some local nets crossing nets of the connecting net data structures may be located on another processor. One option would be to send a message to the appropriate processor to request a calculation of the cost. The overhead involved in the communication makes this prohibitive. Therefore, copies of these nets that span multiple processors are replicated locally. An example of these " crossing" nets is shown in Figure 2 .
Error Control
As described above, each processor independently places its partition of cells without concern for the remainder of cells. Obviously, this can cause inaccuracy in the calculation of cost. Since there are three components to the TimberWolfSC cost function, there are likewise three main cost errors -wirelength cost, overlap penalty cost, and row penalty cost.
Wirelength error
Though the effect of the wirelength error is decreased because of the partitioning to minimize net cut, at high temperatures with frequent movement of cells, it is clear that the error will be signifi cant at the partition borders. The " border" in this context is the cutline and may bear no resemblance to the actual geographical placement border. In order to keep the processors up to date with respect to wirelength cost, at fi xed intervals, updates of this border information are made in a two-stage process. In the fi rst stage, each processor sends its pins that are on " crossing" nets to the owner of the net. Once all the foreign pins for a net have been received, the net information is distributed to all the processors that have a copy of that net.
Overlap penalty error
In addition to the wirelength error, even with a proper partitioning, overlap penalty error is still a serious problem.
Without knowing the overlap penalties due to cells on other processors, each partition will tend to collapse to the center of the layout. Before discussing how the penalty error is managed, we fi rst describe how the overlap penalty is kept in the serial program. Each row is divided into a set of equal sized structures called bins. Each bin is on the order of the size of an average cell. Each bin keeps track of the amount of overlap a particular cell has into that bin. This is shown in Figure 3 . Using these bins, it is possible to quickly determine an estimate of the cell overlap in the circuit.
In a parallel setting, these cells are distributed, so it is not possible for the bins to have information about all cells.
Instead, each bin is told about the location of a foreign cell using a data structure that we call a " fi xed" cell. This cell data structure maintains only the wells of the cell, so it can be shared among all bins that have foreign cells of that length. By doing so, we have enough information to maintain overlap information without replicating cell information.
Of course, the presence of a foreign cell in a particular bin will gradually become inaccurate, but the positions of the fi xed cells are maintained by performing cell position updates. Because of the cost of these updates, these are done very infrequently, at each TimberWolfSC iteration. To further reduce the overlap penalty error, when the cells from all the other processors have been received, each processor individually removes the overlaps by shifting cells appropriately. 
Row penalty error
The pin and cell updates address the wirelength and overlap errors, but they do not adequately address the row penalty error. This error is particularly severe, because of a peculiar " ping-pong" effect. The row penalty is used to force the fi nal placement to have equal length rows, and in a parallel environment this can cause problems. Take for example, the situation in Figure 4 (a). Row 1 is too short and row 4 is too long; thus, all the processors will try to move cells from row 4 to row 1. By the next iteration, row 1 has become too long and row 4 is now too short (Figure 4(b) ). It is clear that this type of row shifting will continue without making any real progress in improving the placement.
We address this problem with three methods. The fi rst is based on the observation that each processor is trying to satisfy a short row without realizing that other processors are doing the same thing. Therefore, we decrease the desired row length to take account of this. Depending on the range limiter, each processor is expected to contribute only part of the cells required to equalize a short row. For example, the placement from Figure 4 (a) is redrawn in Figure 5 with a shorter desired row length.
The second method of addressing the row penalty error is to update the actual row sizes at distinct intervals. Since the amount of data sent is minimal, these updates can be done frequently without a loss of performance. These row updates are done using a lazy propagation update method.
The fi nal heuristic to reduce the row penalty error takes advantage of the penalty feedback mechanism built into TimberWolfSC. Recall from Eq. (1) that the weight of the row penalty in the cost function is adjusted with a sophisticated feedback mechanism. Using experimental observations, the authors of TimberWolfSC have determined the optimal row penalty for each iteration, and then they adjusted the feedback coeffi cient so that the annealing schedule was close to this target penalty. Equation (2) shows the target penalty calculation.
¤ is the total row length.
In a parallel setting, our experiments have shown that this target penalty is not suffi cient. For example, Figure 6 (a)
shows the row penalty for the primary2 circuit for four processors plotted against the iteration number. For comparison, the TimberWolfSC target penalty is plotted. As can be seen, the target penalty is off considerably in the earlier iterations. This deviation affects the § coeffi cient considerably and thus the cost function in Eq. (1) is the break point. We have found that this adjustment makes a tremendous positive effect on the quality of the circuit placement. Note from Figure 6 (b) that with this modifi cation, the row penalty error is much better controlled. 
Dynamic error control
The error control mechanisms described above rely heavily on the use of updates. The cell updates are performed at fi xed intervals; however, the absolute frequency of the row and pin updates was not specifi ed. We use a mechanism called dynamic error control, where the frequency of the updates is adjusted according to the amount of error present.
Several researchers have determined that bounding the accumulated error to a constant factor of the temperature will still guarantee convergence [4, 21] .
Dynamic Redistribution
The fi nal element of our parallel algorithm is the dynamic redistribution. As the annealing schedule proceeds, the initial partition becomes more irrelevant since it corresponds very little to the geographic partitioning of the rows.
While the initial partitioning does reduce the amount of communication in terms of pin updates, the partition being spread across many rows can affect the row penalty calculations as well as cell mobility. For this reason, it is a good idea to repartition the cells so that the partition actually refl ects the geographical row-based partition. The repartitioning is started only after the cells have settled within some proximity to their fi nal destinations. This is done so that the net cut set will be reduced. Through empirical evidence, we have determined that repartitioning should begin at the 40th TimberWolfSC iteration, and every four iterations thereafter until iteration 120. After this point, the range limiter is so small that cells no longer move out of a row, so repartitioning is no longer necessary. Casotto The fi rst phase of repartitioning involves assigning the rows to different processors and then distributing the cells on those rows to the appropriate processors. This process, of course, synchronizes the processors. The next phase is to distribute the nets -which are assigned to the partition that contains the most cells from a particular net. This is done to limit the need for pin updates. The fi nal phase is to distribute the copies of the nets where necessary. After all data structures have been distributed, each processor continues its annealing with its new set of cells and nets.
Algorithm Analysis
The mpiPLACE algorithm is summarized in Figure 7 , where
is the number of moves attempted between pin updates and
is the number of moves attempted between row updates. These values are adjusted dynamically.
The major overhead contributions in mpiPLACE are due to the communication involved in the updates of the pins, rows, and cells. Assuming a balanced distribution of cells, each processor will attempt yield the best tradeoff between quality and speedup. The dynamic error control mechanism will adjust these values to reduce the potential error during the early high temperature iterations.
Timing Driven Approach
In this section we will extend the circuit partitioned placement algorithm to handle performance driven issues as well.
Before discussing the parallel algorithm, we briefl y cover the delay model and algorithm used for serial timing driven placement.
Timing Analysis
Delay model
The best method available for accurate timing analysis of circuits is the SPICE circuit simulation tool. Because the computation requirements of SPICE are prohibitive, it is impractical to use it during placement. Linear delays proportional to net length are more tractable but also more inaccurate. 
where ( ) & is the edge from pin & to its parent and # ¦ § and £ ¦ § are the capacitance and resistance, respectively, along that edge.
' & is the tree capacitance at pin & , in other words, the sum of all edge and sink capacitances on the tree rooted at pin & . The ¢ & ' s for all sink pins can be calculated in a two-phase process. In the fi rst step, the delays for each edge are calculated in a depth fi rst search of the tree; likewise, in the second traversal of the tree, the edge delays can be summed up for each pin. Each step is an O(n) process where n is the number of pins on the net.
It is clear that the routing structure of the net signifi cantly affects the computation of delays. For example, the net in Figure 8 (a) can be routed alternatively as in Figure 8 (b), and the equivalent RC models are shown in Figure 9 . At node 3, there is an implicit Steiner pin that has no sink capacitance. The delays from source to sink 
The example also makes it clear that it is important that the routing be known before an accurate delay can be computed. Several strategies exist to construct near-optimal Steiner trees for improved performance during the routing process [2, 7, 12] . However, during placement, it is impractical to use any of these algorithms to optimally route each net because of the computation time required to do so.
Instead, we quickly approximate the Steiner tree, by building a trunk based tree rooted on the source node. The bounding box of each net is partitioned into a 4x1 grid as shown in Figure 10 segments are built off a vertical trunk. This Steiner tree construction methodology is very quick and can be used effectively during the placement. The majority of nets in most designs have two or three terminals. For two terminal nets, the trees generated by this heuristic are obviously optimal. For three terminal nets, however, the approximate Steiner tree may not be optimal. Since optimal Steiner trees can be quickly created for three terminal nets, we treat these nets as a special case and generate optimal trees. By doing so, we can ensure that 75% to 95% of nets in a typical design will have optimal delay trees.
Path delay analysis
The previous section showed how to calculate the pin to pin delay on a particular net. In this section, we describe the methodology to compute path delays. As described in [23, 28] , we use a block oriented technique in which all the cells are levelized, and then cell and net delays are processed in block order. Circuit levelization is based on a simple breadth-fi rst topological sort algorithm. From the primary inputs, a breadth-fi rst search is initiated such that each edge is traversed once. As each node is visited, the maximum level is assigned to that node, and then when all edges incident on that node have been traversed, every fanout from that node is then explored. Delays are computed by processing the cells in levelized order and then calculating the delays on each pin, by fi rst calculating the cell delay and then the net delay. Once all delays have been computed, the maximum output delay, ¡ , can be determined by examining delay times at each output pin. The longest path can easily be arrived at by tracing back from the output pin with the maximum delay. This algorithm is also O(n) where n is the number of cells.
The path construction algorithms are summarized in Figure 11 . Special care must be taken for sequential circuits. We assume that the circuit can be represented as a Moore model fi nite state machine as shown in Figure 13 . To transform such circuits when constructing delay paths, each latch element output must be treated as a normal primary input and, likewise, each latch element input is treated as a primary output as shown in Figure 14 . There are now four types of delays that appear, PI to PO, PI to latch, latch to PO, and latch to latch. Each of these delays must be accounted for separately, as the minimization objective may only be one or more of these specifi c delays. 
CONSTRUCT-ALL-PATHS
!
Timing Driven Placement
Our timing driven placement algorithm is similar to the algorithm used in TimberWolfSC 7.0 [45] . The cost function from TimberWolfSC 6.0 (Eq. (1)) has been modifi ed as shown in Eq. (14) . We have added a cost term that is used to minimize the longest path delay, 
Now, we describe how to effi ciently keep track of longest path delays during placement. In the context of simulated annealing, every time a move is made, it can possibly affect the longest path delay. Using the algorithm in Section 4.1.2, we can determine the longest path, and then whenever a move is attempted, if a net on the critical path is perturbed by the move, the is easy to calculate. Keeping track of only one critical path can lead to problems because moves that may not affect the pre-determined critical path may create new critical paths. It is not practical to recalculate the longest path for each move attempt so it is necessary to monitor several possible paths.
One solution is, as in previous work [44, 51] , to have the designer provide a set of paths or critical nets that the placement algorithm would use in path delay minimization. However, in large designs, it is very diffi cult for a user to identify these critical paths beforehand. Especially in light of complicated interconnect delays, this task is more diffi cult than ever.
Instead, as was done by Swartz and Sechen [45, 46] , our algorithm identifi es these critical paths for the user.
However, as the number of elements in a design increases, the number of possible paths increases exponentially. For very large circuits, keeping track of all paths becomes very intractable. Therefore, we identify only longest paths between all pairs of inputs and outputs. For sequential circuits, this includes all latch inputs and outputs as well. This reduces the number of paths considerably. In addition, only paths that have delays within 10% of the longest path delay are kept. With these restrictions, only paths (2,5,6) and (3, 5, 6) from Figure 12 will be kept. As more moves become accepted and our list of longest paths is no longer applicable, it is necessary to periodically recalculate the longest paths. We have found that is suffi cient to perform this calculation every accepted moves where is the Note that when a move is proposed the entire path need not be traced from source to sink to determine the change.
Instead, each net keeps a list of paths that it is part of and applies its to all these paths. In Figure 15 , cell 3 has been moved causing three nets to change. This move affects three paths -(3,5,7), (3,7), and (3, 8) . Instead of tracing the effect of the move all the way to the outputs on all paths, since each affected net has a link to the path, we can apply the directly to the path. We then process the list of paths, to determine the new longest path. In this case, the longest path will not change because of this move. The intermediate arrival times will be inaccurate of course, but that is acceptable, since our only concern during annealing is the change of the longest path delay.
Parallel Timing Driven Placement
Providing timing driven placement can add signifi cant overhead to the normal run time of cell placement. In this section we describe an algorithm for parallelization of timing driven placement. The algorithm is based on the approach described in Section 3. 
PATH BASED TIMING DRIVEN PLACEMENT
Parallel path delay analysis and path construction
As with the serial algorithm, there are two phases to the delay analysis: delay calculation and path construction.
Calculating delays is done as before, by processing the cells in levelized order. For each level, each processor calculates the cell delays for all cells at that level. Then for each net in the fanout of the cell, the net delay is calculated and any sink pins that have changed their arrival times are marked. All processors synchronize at each level by distributing the arrival times for any marked pins. The algorithm is summarized in Figure 17 .
Likewise, path construction must be done with synchronization points at each level. The algorithm is summarized in Figures 18 and 19 . As we proceed back from the primary outputs, each processor identifi es path segments using the CONSTRUCT-PATH algorithm described in Figure 
Parallel placement algorithm
The algorithm for parallel timing driven placement is summarized in Figure 20 . In structure, it is very similar to the algorithm for mpiPLACE shown in Figure 7 . The only major modifi cation is to insert the call to perform delay analysis as described above. One minor change is also applied to the repartitioning algorithm. In the non-timing driven placement algorithm, the nets are assigned to the partition containing the most cells attached to the net. Because of the parallel delay analysis approach, this heuristic is no longer appropriate. Instead, we assign nets to the partition containing the source pin. This approach limits us to circuits with only a single source pin per net. This is not a severe limitation, as it is easy to transform a multisource net into a single source net through insertion of intermediary buffers.
Experimental Results
Speedup and quality
We built mpiPLACEusing the MPICH [8] 
Error control
We also compared the effect of 8 B on the fi nal quality and speedups. Figure 23 shows the results for primary1 on an eight-processor SparcServer 1000. It can be seen that both provides the best tradeoff between performance and quality.
As a measure of the effectiveness of all the error control mechanisms, we turned off all these mechanisms and compared the results. These are shown in Table 3 . It can be seen that the error control methods that we used contributed greatly to reducing the error, at a cost of some speedup.
Using mpiPLACE we performed two more experiments to further show the usefulness of the algorithm in situations where the circuit is too large to be placed in the memory of a single machine. The traditional approach has been to take these circuits, partition them, place each partition separately and then merge the separate placements. As an example, Figure 24 shows a large circuit that has been divided into eight smaller circuits. Each of these individual circuits can be placed independently and then fi nally combined into the larger placement. Using a similar procedure, we took the largest circuit available to us, avq.large, and partitioned it into eight subcircuits using the Fiduccia-Mattheyses method. Each partition is placed individually using TimberWolfSC and then merged back together. The results are shown in Table 4 along with comparisons with TimberWolfSC and mpiPLACE. Note that mpiPLACE has much better quality as well as a speedup in run time as compared with a strictly partitioned approach. Bear in mind, also, that the times reported for the partitioned TimberWolfSC approach only include the placement time and not the time for merging or partitioning the circuit.
In addition, we wanted to fi nd how effective mpiPLACE is when run on a large number of processors. Again using Table 5 .
Note that the quality suffers severely, so it is clear that such an approach is not appropriate for a large number of processors unless the circuit is suffi ciently large. Current circuit sizes are not large enough to take advantage of such a large number of processors, but in the future it is anticipated that circuits will be large enough.
Timing driven placement
In this section, we present results for the timing driven version of mpiPLACE. We fi rst compare the sequential results with TimberWolfSC 6.0 (Table 8) . Our experiments use four of the MCNC benchmarks which include timing information and technology parameters from the MOSIS 2.0 design rules as shown in Tables 6 and 7 . The area and delay numbers are taken after the circuit has been globally routed, but the wirelength and execution times are only for the placement procedure. Note that since the global router is not timing driven, the delay does vary somewhat from that predicted by the placement process. The results show an average of 12% improvement in the longest path delay at the cost of about 5% increase in area. By using a more accurate delay model, we are able to select critical paths that may not be apparent in a less detailed model. Also, the accurate delay model allows us to be more confi dent in the fi nal longest path timing characteristics.
The parallel timing driven placement algorithm is called mpiPLACE-TIME and experimental results are presented for a Sun SparcServer 1000E as well as the Intel Paragon in Tables 9 and 10 . As with the original mpiPLACE, we get reasonable speedups with moderate wirelength degradation. There is little degradation of the delay as well.
Conclusions
In this paper, we described mpiPLACE, a circuit partitioned approach to parallel timing driven cell placement. We have introduced a new timing driven algorithm that uses a detailed Elmore delay model. Using sophisticated error control mechanisms to improve solution quality, the parallel algorithm is able to achieve reasonable speedups with moderate degradation in quality. Though it does not provide perfect speedups, the primary advantage of partitioning the circuit is that it helps memory scalability. We are able to run circuits that are too large to fi t on one processor by distributing it across the nodes of a multiprocessor. We have expanded on this work to also address circuit partitioned parallel algorithms for global routing [50] . 
