loosely coupled parallel algorithm for the placeell integrated circuits. Our algorithm is a derivaserial version. Despite using the as the only means of interprocessor the type of parallel hardware employed as well as sor communication schemes used. These methods ase, these previous methods failed to yield results as produced by the state-of-the-art serial standard cell diminished the utility of the prior methods is that ubstantially less than linear speedup as a function of processors employed. Worse still, most of these ard cell placement algorithm which ran on a low-cost workstations and which yielded to produce results at least as good as those ever the widely adopted set of benchmark circuits available ct, we report precisely such a parallel placement algorithm in this paper, which is organized as follows. In Section 2 we first review the previous parallel approaches developed for standard cell placement. In Section 3, we very briefly touch on the key aspects of the serial placement algorithm, derived from simulated annealing, from which we began the development of our new parallel algorithm. Our new parallel placement algorithm is the subject of Section 4. We analyze the performance of the algorithm in Section 5. Finally, in Section 6 we present our results.
aspects of the serial placement algorithm, derived from simulated annealing, from which we began the development of our new parallel algorithm. Our new parallel placement algorithm is the subject of Section 4. We analyze the performance of the algorithm in Section 5. Finally, in Section 6 we present our results.
Previous Work
Among the first reported parallel algorithms for standard cell placement were those by Rutenbar and Kravitz[14] [20]. They proposed two shared memory multiprocessor simulated annealing algorithms. One was based on move decomposition and the other on parallel moves. A move could be either the displacement of a single cell or the exchange of two cells. In the first approach, a move was decomposed into several sub-tasks and parallelism was exploited in these sub-tasks. This scheme was only able to exploit limited parallelism. In addition, the synchronization of these subtasks had to be performed very carefully. In [14] , a speedup of 2 was reported by using 3 processors. The speedup increased only slightly with the introduction of additional processors.
In their parallel moves approach, a serializable subset is defined as a subset of all non-interacting moves that can be carried out serially in any order. Thus, all moves in this subset can be carried out in parallel and the outcome would be the same as if the moves had been executed and evaluated serially. An advantage of this approach is that the convergence behavior of simulated annealing is the same as that for the serial version [l6] . However, it turns out that finding serializable sets is extremely dficult. In fact, the authors resorted to the simplest form of this approach which is based on attempting and evaluating a group of moves in parallel, and then actually performing the single move that is accepted first (if any) and then aborting all the other attempts. As noted in [3], t h s approach unfortunately puts a bias in favor of moves that required less computation time. It turned out that this method showed a linear speedup with the number of processors for the lowest temperatures in the annealing schedule, where few moves are accepted, but performed poorly at higher temperatures.
Rose, et al. [IS] [ 191 proposed a method called heuristic spanning to replace the high temperature regime in simulated annealing. Parallelism was obtained by executing a mincut algorithm with different starting partitions on different processors. After collecting the results generated by the various processors, the one with the lowest cost is chosen to enter the low temperature annealing phase. This phase is accelerated by mapping different regions of the chip onto different processors. Every processor performed moves in parallel and then broadcast the new cell locations to the other processors. To avoid idle processors, asynchronous moves were exploited at the cost of introducing error in the wire length calculations since processors often had different views on the current position of some portion of the cells. The processor utilization efficiency was estimated to be 70%.
In [4] and [30] [21] . Due to the hypercube stsucture, only processors with a particular address pattem are allowed to perform cell interchanges. This rather restricts cell movement. To avoid error accumulation [21] , sets of non-interacting cells are identified and only cells in the same sets are allowed to be exchanged. This further impeded cell movement. This algorithm performed and evaluated moves in parallel on all processors. After each move, the new cell position was broadcast to all processors. The global synchronization scheme was rather expensive. In [21] , the expected speedup was 11-21 using 64 processors.
Of all the parallel implementations of simulated annealing, those which perform moves in parallel attract the most attention because they have the most potential for appreciable speedups. Parallelism can be highly exploited by performing moves on all of the processors at the same time. All of the previous methods required frequent application of some form of global synchronization. They either broadcast new information after each move or after fewer than 10 moves. This rather high communication cost restricted its application to only special parallel architectures that can provide high communication throughput in the form of either shared memory, a hypercube interconnection network or other dedicated hardware.
There has been a small amount of previous work on developing standard cell placement implementations for loosely coupled parallel processing environments, in particular, consisting of networks of low-cost workstations. Mohan, ef al., [17] presented a placement approach based on the genetic algorithm. Also, Kling, et al., [12] presented a method based on simulated evolution. However, neither of these approaches were shown to be comparable in terms of placement quality with the state-of-the-art serial implementation of simulated annealing. Finally, Banerjee, et al., have developed a parallel version of simulated annealing for this loosely coupled environment [9] [ lo]. They based their approach, which they called ProperPLACE, on TimberWolfSC version 6.0. Unfortunately, the results produced by ProperPLACE diverged badly as circuit sizes increased in comparison to TimberWolfSC 6.0. In addition, the speedup as a function of the number of processors was far from linear.
As a consequence of the limitations of the previous approaches to parallel placement, particularly parallel implementations of simulated annealing, none of these methods has ever been used in industry. However, by no means does this imply that faster placement isn't badly needed. Although the state-of-the-art serial version of simulated annealing for standard cells (Timberwolf 7.0 [26] [27] ) was reported to be dramatically faster than the previous version (6.0), in fact, seven to ten times faster for larger circuits, the computation time on a state-of-the-art workstation can still approach 24 hours for a 200,000-cell gate may. No other serial algorithm has been reported which can yield comparable results in less computation time.
It would therefore be of considerable interest if there existed a coarse-grained, parallel standard cell placement algorithm which ran on a standard network of low-cost workstations and which yielded results at least equivalent to TimberWolfSC version 7.0, and in addition, offered nearly linear speedup with the number of processors used. Furthermore, the new parallel method would have to produce results at least as good as those ever reported for the widely adopted set of benchmark circuits available from MCNC.
Serial Version of Simulated Annealing for Standard Cell Placement
We felt that we could develop the most effective loosely coupled parallel algorithm for standard cell placement by basing it on the most effective serial algorithm available, name TimberWolfSC 7.0. In this section we review the key aspects of this algorithm.
In [26] [27] a new approach to simulated annealing and a new hierarchical algorithm for row-based placement was described. This implementation obtained the best results ever reported for the set of MCNC benchmark circuits. Chip area reductions up to 15% were achieved compared with TimberWolfSC v6.0. In addition, chip area reductions up to 21% were achieved while consuming up to 7. 1. randomly select cell a 2. randomly select row r and location x in r 3. I* x in r is within the range l i t e r window span for a [24] "1 4. if (adding a to r doesn't exceed length limit for r) then
5.
6. else 7. 8.
9.

10.
11. else 12. Figure 1 shows the algorithm used by TimberWolfSC 7.0 for generating new placement configurations. First, cell a is randomly selected. A single cell move is attempted if the target row's length limit is not exceeded. Otherwise, a cell b which covers the target location is noted and an interchange of a and b is attempted if no row length limits are violated. The length limit is set to be the smaller of one percent of the average row length or one average standard cell width. If a limit was violated, the new state generator begins anew. If a single cell or interchange move is feasible, then the change in cost is computed. The probability of accepting the new configuration is one if A C I O ; otherwise, it is equal to e (-*"n, where T is the temperature. If the new configuration is accepted, the cells in the affected rows are shifted to avoid any cell overlapping. Every new configuration thus generated is legal and physically feasible.
The incremental cost function used by TimberWolfSC 7.0 is given in (1). AW is the change in the net lengths for those nets connected to the cell (or two cells) participating in the single cell (or exchange) move. A Ws represents the change in the net lengths for those nets connected to the cells in the affected rows which must be overlapping. When a new cell is this row generally need to be
cell. An effective and efficient cells are given in [26] [27].
hich has both n and y components, to generally be several ger than for an intra-row move, which has only an n com- 
Figure 3 Annealing Schedule in Hierarchical Mode
As stated above, the first level netlist is placed in the higher temperature regime. This first stage comprises the first 50% of the total annealing schedule as shown in figure 3. The second placement stage starts at 50% and ends at 70% of the total annealing schedule as shown in figure 3. The constituent cells belonging to a cluster are randomly placed within the confines of the location of the bounding rectangle of the cluster as determined in the previous stage. The third stage starts at 70% of the total annealing schedule. The restarting temperature T o f the second (and third) placement stage is given by (2) , where A W is the average net length and U is the target acceptance rate.
-
W The total number of moves (the n axis in figure 3 ) is divided into 150 iterations. Each iteration consists of a number of moves equal to the total number divided by 150. The total number of moves, m, is given by (3), where n is the number of cells in the circuit.
The New Parallel Placement Algorithm
Introduction
Given that our parallel processing hardware consists of a network of workstations, our objective was to achieve near linear speedup with the number of processors (workstations). There are two main obstacles: 1) how to absolutely minimize interprocessor communication since the local area network is quite slow, and 2 ) given that moves will be generated and evaluated in parallel, how to execute (generate and evaluate) these moves effectively in the presence of erroneous information on the location of some portion of the cells.
Communication between the processors can be reduced to zero by dividing the chip into n non-overlapping regions and assigning each of the n processors to a unique region. Each processor then optimizes the locations of the cells which were initially placed in its region. There are several critical problems with this approach. First, it is virtually impossible to initially assign all of the (3) I cells into their proper region. Partitioning algorithms which have been proposed to date have not been shown to be effective for this problem. Second, given that some cells will therefore not be in their optimal regions, there is no way for the cells to move between regions and to therefore improve their region assignment. Third, when optimizing the placement of the cells within its region, the processor has almost no idea of where the cells are in other regions. Given that there can be more than 10,000 cells in a region and that there can be a similar number of inter-region nets connecting these cells, this will be a very serious source of error when computing total wire lengths (the basic cost function in annealing-based placement algorithms).
In summary, a parallel implementation based on no interprocessor communication can be characterized as having virtually linear speedup with the number of processors, but will not yield stateof-the-art results because of the confinement of the cells to the original regions and because of the uncertainty in the positions of the cells outside of a given processor's region. In our new parallel algorithm, we retain the main advantage of the no-interprocessor-communication technique (virtually linear speedup) while avoiding its key disadvantages: 1) by permitting a small (almost negligible) amount of inter-processor communication and 2) by using a new dynamic region generation scheme.
Interprocessor Communication
In our approach, each processor maintains its own copy of the complete data structures for the entire placement problem. At the start of each iteration, the entire chip area is divided into a union of non-overlapping regions. Each processor is assigned to a unique region and this region will be termed the active region for that processor. All other regions are termed inactive with respect to that processor. A processor will only move those cells that are initially (i.e. at the start of the iteration) in its active region. It will never move a cell outside of its active region. Furthermore, a processor assumes that the positions of the cells in its inactive regions remain as they were at the beginning of the iteration. In the next section we will address the impact of this error.
Interprocessor communication only takes place at the end of each iteration (that is, 150 times during a placement run). Each processor broadcasts the positions of only those cells in its active region which have changed since the end of the previous iteration. Each processor also receives new position information for only those cells in its inactive regions which have moved since the last iteration. 
Dynamic Region Generation
During the course of an iteration, the cells in a processor's active region are not permitted to move outside that region. Therefore, badly misplaced cells (cells assigned to non-optimal regions) can seriously degrade the quality of the final placement. We found that how the chip area is divided into regions has a profound effect on the placement quality.
We discovered that the most effective approach is to dynamcally generate the regions. In addition, we found that in order to get good placement results, it is very important that each processor be capable of generating and evaluating both long distance and short distance moves. We were able to satisfy these criteria with the following scheme. At the beginning of each odd numbered iteration, the chip area is divided into equal-size regions which consist of n vertical slices as illustrated on the left-hand side of figure 4. At the start of each even numbered iteration, division of the chip area into non-overlapping equal-sue regions (now consisting of horizontal slices) takes place as shown on the right-hand side of figure 4. In either case, region i is the active region for processor i.
Note that this scheme permits a cell to move from its current position to anywhere else on the chip in at most two iterations. This is illustrated in figure 5 , where cell A is currently in the upper left comer of the chip but should move to the lower right comer to maximally reduce the total wire length. During the current iteration (left-hand side of figure 5 ) A can move to the bottom of its region and then during the next iteration (right-hand side of figure 5) it can move to the extreme right of its region. for each processor in its active region do generate a move determine whether to acceptlreject /* see figure 1 for details */ periodically adjust the temperature /* for each processor, force the actual acceptance /* rate to track the target rate shown in figure 3 */ until the iteration is complete /* an iteration consists of a fixed no. of moves */ /* an iteration is terminated for all processors as /* soon as one processor completes the iteration */ each processor broadcasts its cell position changes reduce range limiter window dimensions set temperature to average temperature over all regions Figure 6 shows the overall pseudo code for our new parallel standard cell placement algorithm. Line 14 implies that each processor manipulates its own value of the temperature so that its actual acceptance rate follows the target rate shown in figure 3 . The temperature is updated about every ten moves; it can be raised or lowered as necessary. Since each processors maintain its own temperature, the temperature will be a little different on the various processors. Because the regions change aspect ratios, each processor will generally get a largely different set of cells in its active region from one iteration to the next. Therefore at line 23 we set the initial temperature value for the next iteration equal to the average at the end of the previous iteration. The range [23] [28] in line 22 is used to generate moves more early iterations, the window dimensions are very cells to move any distance across the chip. As the the window size is reduced so that we preferenthat have a non-negligible chance of being (4) h an x will be generated. The window size r is inidimension of the entire chip and it is then decreased with respect to the iteration number). It reaches its prevented by the parallel algorithm. If this happens , the cell mobility will be affected. On the other hand, r than the size of the region, we contend that the cell not be affected. . .
I rl l i
Figure 8 Move Distance in y Direction
(Distance) Figure 8 Move Distance in y Direction As mentioned earlier, r is initially set to the chip dimensions and it is exponentially decreased (as a function of iteration) down to a minimum size of about two average cell widths just over half way through the annealing schedule. Two average cell widths is a very small distance, relative to the chip dimension, given the size of standard cell circuits today, which include up to 100,000 cells.
Since r is exponentially decreased, this implies that w > r for most of the iterations. This provides a rich opportunity to explore parallelism. 
Error Analysis
In the parallel algorithm, each processor evaluates cell moves within its active region assuming that the cells in the other regions do not change during the course of the iteration. This leads to error when a processor computes the change in wire length due to amove in its region. The errors are not eliminated until the global synchronization step of line 21 in figure 6 where each processor broadcasts (to all other processors) the positions of cells that have changed since the last update (iteration). Figure 11 shows a scenario a processor might face in our parallel algorithm. The shaded area is the active region assigned to processor i. The processor can only move cells within its active region. Three nets are shown: net a connecting 4 cells, net b connecting 2 cells, and net c connecting 3 cells. We classify all of the nets into one of two categories: internal nets or crossover nets. A net whose span is entirely within one region is defined as an intemal net; otherwise, it is a crossover net. In figure 11 , nets a and b are internal nets while net c is a crossover net.
Figure 11 Internal and Crossover Nets
For all intemal nets, a processor either knows exactly the span of the net, or it does not care about the net. In either case, no error will result. For example, net a in figure 11 is an intemal net. All of its four cells are in the active region of processor i. This processor knows the enact locations of these four cells at all times because only it is allowed to move them. Therefore there will never be any error when processor i computes the change in length of net a. On the other hand, processor i does not care about net b since this net has no connections to cells in region i. Thus processor i will never be called upon to evaluate the change in b's length and therefore no error will ever result.
Crossover nets, however, may lead to errors when a processor is computing the changes in wire length resulting from a cell move in its region. For example, net c in figure 11 is a crossover net. Cell p on net c is not in the active region of processor i and hence this processor does not know the exact location of this cell. Processor i assumes that cellp remains at the position it was at the beginning of the iteration even if it was moved upward by some other processor as indicated in figure 11 . This source of inaccurate cell information causes error when processor i evaluates its moves. It is straightforward to experimentally ascertain the magnitude of this error. At the end of iteration k, just before each of the processors broadcast the updated cell positions to all of the other processors, we note Ck, a processor's view of the total length of all of the crossover nets and from this we subtract ck, the (exact) value after the broadcast. The percentage error Ek is then determined by dividing by W k , the overall total wire length (i.e. the sum of the lengths of the internal and crossover nets after the broadcast), as shown in (5).
For the MCNC benchmark circuit Primaryl, in figure 12 we plot E, as a function of k for two different runs: one using two processors and the other using four processors. In each case, the total number of moves was approximately one million. Two observations are apparent: First, Ek for four processors has greater absolute value than for two processors. This is not surprising since dividing the chip area into four regions as opposed to two is certain to yield more crossover nets. Second, in the very beginning when virtually all moves are accepted, it is equally likely that a processor will over estimate as under estimate a crossover net's length. Thus the net percentage error is around zero. Since the number of long nets is large, leading to many crossover nets, the percentage error increases as the acceptance rate falls. It reaches its maximum value when the acceptance rate falls to around 50 percent (or about iteration 30). From then on the error decreases toward zero as the annealing schedule proceeds.
This decreasing trend is due to several factors: 1) Net lengths continually get smaller and therefore fewer and fewer crossover nets will exist as the iterations go by. 2) As the range limiter window dimensions decrease, the distance that cells move during an iteration continually decreases implying that a processor's view of the positions of cells in its inactive regions will be closer to reality as the iterations go by.
In any case, the percentage error at any iteration is under about two percent. And even this maximum error occurs only when the acceptance of uphill moves is prevalent. When the temperature gets low enough that primarily only downhill moves are accepted, then the percentage error is virtually zero. Hence we would expect the parallel algorithm to yield comparable results to the serial algorithm. Figure 13 shows the result of Ek versus k for circuit Biomed using 4 processors. Figures 14 and  15 show the results of Ek versus k for circuit Biomed and Avqsmall, respectively, by using 6 processors. The results consistently show that there is some error in the first half of the annealing process and the error decreases toward zero in the second half. The maximum absolute value of Ek is less than 2%. Table 5 , 6, and 7 show the run time speed up gained by our parallel algorithm. All reported times are elapsed times (including CPU times and interprocessor communication times) in seconds on DEC alpha workstations 3000/400. The results show that our P Figure 16 plots the speed up achieved by our parallel algorithm. Although OUT target parallel environment (a network of workstations) only provides limited communication bandwidth, our parallel algorithm still achieves near linear speedup, especially for larger circuits. And it's exactly those large problems for which one would want to have a parallel implementation available. Table 8 shows the processor utilization in our parallel environment. They are obtained by dividing the CPU time by the elapsed time. The utilization is on average 98% when 2 processors are used. The average utilization is 87% when 6 processors are used. 
1----
I
_ _
Processor Utilization
Conclusion
We presented a loosely coupled parallel algorithm for the placement of standard cell integrated circuits. Our algorithm is a derivative of simulated annealing. The implementation of our algorithm is targeted toward networks of UNIX workstations. This is the very first reported parallel algorithm for standard cell placement which yields as good or better placement results than its serial version. In addition, it is the first parallel placement algorithm reported which offers nearly linear speedup, in terms of the number of processors (workstations) used, over the serial version. Despite using the rather slow local area network as the only means of interprocessor communication, the processor utilization is quite high, up to 98% for 2 processors and 90% for 6 processors. The new parallel algorithm has yielded the best overall results ever reported for the set of MCNC standard cell benchmark circuits.
