Simulated annealing, a methodology for solving combinatorial optimization problems, is a very computationally expensive algorithm, and as such, numerous researchers have undertaken efforts to parallelize it. In this paper, we investigate three of these parallel simulated annealing strategies when applied to standard cell placement, specifically the TimberWolfSC placement tool. We have examined a parallel moves strategy, as well as two new approaches to parallel cell placement, multiple Markov chains and speculative computation. These algorithms have been implemented in ProperPLACE, our parallel cell placement application, as part of the ProperCAD II project. We have constructed ProperPLACE so that it is portable across a wide range of parallel architectures. Our parallel moves algorithm uses novel approaches to dynamic message sizing, message prioritization, and error control. We show that parallel moves and multiple Markov chains are effective approaches to parallel simulated annealing when applied to TimberWolfSC, yet, speculative computation is wholly inadequate.
Introduction
Simulated annealing [1] has long been acknowledged as a powerful combinatorial optimization tool, but its drawback has always been its appetite for computational resources. In light of this, several researchers have investigated parallel implementations of simulated annealing. We are interested in the application of these parallel simulated annealing algorithms with respect to cell placement, particularly TimberWolfSC [2] , one of the more popular standard cell placement tools to use simulated annealing. Of the several generalized algorithms proposed for parallelizing simulated annealing, only a few have been applied to cell placement. Parallel moves has been the most popular strategy [3, 4, 5, 6] , and in this paper we will present a new implementation of this approach. We will also investigate two new parallel simulated annealing algorithms that have not been used for cell placement, namely, multiple Markov chains [7, 8] and speculative computation [9] .
The three parallel simulated annealing algorithms have been implemented in ProperPLACE, our standard cell placement tool, as part of the ProperCAD II project. By building around an existing sequential placement algorithm, TimberWolfSC, we can benefit from future improvements of the sequential algorithm. In doing so we are able to ensure that the overheads of parallelization are kept low so that the parallel algorithm can automatically benefit from future improvements in the uniprocessor algorithm. As an indication of this effort, we have been able to reuse nearly 95% of the sequential source code from TimberWolfSC 6.0. Another important feature of the ProperPLACE application is its portability -the code can run on a wide variety of MIMD architectures with no change to its source code. In the first of the parallel algorithms we investigate, parallel moves, we address a significant problem with current approaches to parallel placement. In all the asynchronous schemes proposed to date, maintaining the accuracy of local databases has been the major hurdle. This inaccuracy in the database arises from the fact that accepted moves by a processor are not relayed to the other processors on time. To reduce this inaccuracy, we propose a prioritized message passing technique along with a dynamic message sizing control mechanism.
For all three parallel strategies, we present results on several MCNC and ISCAS benchmark circuits for both shared memory machines (Sun 4/690MP and Sun SparcServer 1000) and distributed memory environments (Intel Paragon and cluster of Sun 4 workstations) . The various strategies have differing impacts on both solution quality and runtime, and these are examined in further detail.
The rest of this paper is organized as follows. In the following section, we present a brief overview of the ProperCAD II project. The next section introduces the placement problem and offers a synopsis of the TimberWolfSC algorithm. Section 4 reviews some of the algorithms proposed for parallel simulated annealing as well as some of the previous work done in parallel placement. The next three sections describe the three parallel simulated annealing algorithms that we are evaluating in this paper. In each of these sections, we present a detailed analysis of the algorithm as well as experimental results.
An overview of the ProperCAD II project
The rapid improvement in VLSI technology over recent years has made circuit design an extremely complex process and in turn has been placing increasing demands on CAD tools. Parallel processing is fast becoming an attractive solution to reduce the inordinate amount of time spent in VLSI circuit design. This has been recognized by several researchers in VLSI CAD as is evident in recent literature for cell placement [6, 3, 10] , floor planning [11] , circuit extraction [12, 13] , test generation and fault simulation [14, 15, 16] , etc.
However, much of the work in parallel CAD reported to date suffers from a major limitation. The parallel algorithms proposed for these applications are designed with a specific underlying architecture in mind. As a result, these programs perform poorly on architectures other than the one for which they were designed. Even more importantly, incompatibilities in programming environments also make it difficult to port these programs across different parallel architectures.
This limitation has serious consequences, since a parallel algorithm needs to be developed afresh for every target MIMD architecture. This is compounded by the length of the software development cycle, which, for parallel applications, is considerably longer than for sequential programs. Consequentially, parallel programs are significantly costlier to develop than sequential programs.
One of the primary concerns of the ProperCAD project [17] is to address this portability problem by designing algorithms to run on a range of parallel machines including shared memory multiprocessors, distributed memory multicomputers, and networks of workstations. The ProperCAD approach to the design of parallel CAD algorithms is illustrated in Figure 1 . A parallel algorithm is designed around an existing uniprocessor algorithm by identifying modules in the uniprocessor code and designing a well-defined interface between the parallel and sequential code.
The project has undergone two distinct phases. The first phase, ProperCAD I, involved the use of the C-based Charm language and runtime system [18, 19] . As part of ProperCAD I, a suite of parallel applications was developed that address the most significant tasks in VLSI design automation including circuit extraction [20] , test generation [21] , and logic synthesis [22] . An earlier version of our placement tool, ProperPLACE, was also developed using ProperCAD I [23] . The second phase, ProperCAD II [24, 25] , entailed the creation of a C++ library which provided an object-oriented parallel interface based on the actor model of concurrent object-oriented computing. This library-based approach as opposed to Charm's language-based system offered us several advantages that are described in greater detail in [24] .
The library contains two distinct interfaces, the actor interface (AIF) and the abstract parallel architecture (APA).
The AIF provides the mechanisms necessary for parallel execution in the ProperCAD II environment. Concurrency is achieved through the use of a fundamental object called an actor [26] . An actor object consists of a thread of control that communicates with other actors by sending messages, and all actor actions are in response to these messages. Specific actor methods are invoked to process each type of message, and actors are not allowed to block or explicitly make receive requests from other processors. The runtime system on each processor picks the next available actor thread with some priority and that thread is then allowed to run to completion without interruption. Also concurrent abstractions known as aggregates [27] are available in ProperCAD II to support a multi-access interface to groups of actors.
The implementation of the actor interface is defined in terms of the APA, a low-level interface which provides an interface and implementation which can be used to describe and utilize resources needed by any parallel application across a variety of architectures. The APA is used by the AIF but may also be used by applications directly. The AIF has been carefully integrated with the APA so that common types of architectural tuning and incremental parallelization can be expressed in a systematic way, reducing the need for ad hoc combinations of AIF and APA references.
Applications created using ProperCAD II include test generation [28] , fault simulation [29] , logic synthesis [30] , VHDL simulation [31] , and placement. The latter is the focus of this paper.
The placement problem
The VLSI cell placement problem involves placing a set of cells on a VLSI layout, given a netlist which provides the connectivity between each cell and a library containing layout information for each type of cell. This layout information includes the width and height of the cell, the location of each pin, the presence of equivalent pins, and the possible presence of feed through paths within the cell. The primary goal of cell placement is to determine the best location of each cell so as to minimize the total area of the layout and the length of the nets connecting the cells together. With standard cell design, the layout is organized into equal height rows, and the desired placement should have equal length rows.
TimberWolfSC
One of the more popular sequential applications for placement has been the TimberWolfSC set of cell placement tools [2, 32] . TimberWolfSC's core algorithm, simulated annealing, is a suitable approach to problems like VLSI cell placement since they lack good heuristic algorithms. Briefly, simulated annealing is an iterative improvement strategy that starts with a system in a disordered state, and through perturbations of the state, brings the system gradually to a low energy, and thus optimal, state. A significant characteristic of simulated annealing is that, unlike greedy algorithms, perturbations that increase the energy of the system are sometimes accepted with a probability related to the temperature of the system [1, 33] . In the context of cell placement, and TimberWolfSC in particular, perturbations are simply moves of the cells to different locations on the layout, and the energy is an approximated layout cost function, consisting of three parts:
Estimate of wirelength of all nets as the half perimeter of the bounding box.
Penalty for area overlap between cells in the same row.
Penalty for difference between actual and desired row length
Moves are generated by choosing a random cell and then displacing it to a random location on the layout. If a cell is already present at the new location, the two cells are exchanged. A temperature dependent range limiter is used to limit the distance over which a cell can move. Initially, the span of the range limiter is set such that a cell can move anywhere on the layout. Subsequently, the span is decreased logarithmically with temperature. These range limiter updates are made at the end of each of the 160 iterations into which TimberWolfSC segments the simulated annealing procedure. As the algorithm progresses, the temperature is gradually decreased by forcing the acceptance rate to follow a theoretically derived schedule that attempts to keep the acceptance rate close to 44% during the middle region of annealing [34] . TimberWolfSC 6.0 also uses row bins to aid in the computation of overlap and row penalties, and early rejection methods are used to speed up the decision process [35] .
Parallel annealing strategies
Because of the inherent computational costs associated with simulated annealing, several methods have been proposed for the parallelization of the procedure. We briefly describe four major approaches that researchers have proposed to apply parallelism to simulated annealing. [36, 7] . This algorithm is particularly promising since it has the potential to use parallelism to increase the quality of the solution.
Generalized
D. Speculative computation. Speculative computation attempts to predict the execution behavior of the simulated annealing schedule by speculatively executing future moves on parallel nodes. The speedup is limited to the inverse of the acceptance rate, but it does have the advantage of retaining the exact execution profile of the sequential algorithm, and thus the convergence characteristics are maintained.
Parallel algorithms for placement
Many researchers have investigated the parallelization of placement algorithms, but only methods A and B have been used. Kravitz and Rutenbar [3] tried approaches A and B on a shared memory multiprocessor and obtained a speedup of 2 on 4 processors for the first approach and 3.5 on 4 processors in the second approach. Banerjee, Jones and Sargent [4] implemented a parallel placement algorithm using the parallel move approach on an iPSC/2 hypercube multiprocessor and proposed several geographical partitioning strategies for the problem specific to the hypercube topology. Speedups of 12 on 16 processors were reported. Using approach B, Casotto et al. [37] worked on speeding up simulated annealing for the placement of macrocells, and achieved speedups of 6 using 8 processors on a shared memory multiprocessor.
Rose et al. [5] proposed a parallel algorithm on an experimental distributed memory multiprocessor. In that algorithm, they replaced the high temperature portion of the parallel simulated annealing placer with a min-cut based algorithm and used a parallel moves strategy for the lower temperatures. Speedups of 4 on 5 processors were reported. Jayaraman and Rutenbar [11] proposed for the Intel iPSC hypercube multiprocessor a parallel floor-planning algorithm that uses parallel moves along with periodic synchronization to control the error. Both Casotto and Sangiovanni-Vincentelli [38] and Wong and Fiebrich [39] have presented results on parallel moves implementations of simulated annealing placement on the Connection Machine. Sun and Sechen have shown results achieving near linear speedup on a network of workstations using a parallel moves approach [40] .
These and all the other previous work on parallel placement algorithms share a common drawback: each is proposed for a specific parallel architecture. Secondly, most of these parallel algorithms were developed from scratch by rewriting the sequential algorithm; hence, the performance of the algorithms in a single processor was not as efficient as the best sequential algorithms. Finally, the error in the cost function evaluation due to parallel move evaluation was not controlled, leading to significant quality degradation as more processors are added.
In this paper, we will investigate three of the simulated annealing algorithms presented above -parallel moves, multiple Markov chains, and speculative computation. We decided not to evaluate move acceleration since the potential speedup reported by [3] was not significant, and also it is only practical on shared memory architectures. Our implementation of parallel moves is particularly notable as it is the first portable implementation of such a strategy. This and the other two methods have been implemented as part of our parallel cell placement tool -ProperPLACE, and they are described further in the following sections.
ProperPLACE-PM Parallel moves
ProperPLACE-PM exploits parallelism by using parallel moves and allowing errors in the cost function. The application begins with a random input placement that is replicated on each available physical processor. Using the aggregate feature of the ProperCAD II library, an aggregate named Circuit is constructed to manage access to the circuit structure and maintain a coherent state of the current placement. Each processor will have one representative of the aggregate responsible for its local copy of the circuit. In addition, an Anneal actor is created per physical processor to perform the annealing steps -i.e. move, evaluate, and decide. Figure 2 shows the relationship between the aggregate and its dependent actors.
After the creation of the actors, the placement is divided up topographically by rows, with the rows and its cells assigned to separate Anneal actors. For example, the placement in Figure 3 has 4 rows and is replicated on each of 4 processors. Each actor is responsible for one row as shown in Figure 3 , and thus is only allowed to attempt moves on cells in that row. If a cell is moved to a region owned by another actor, the ownership of the cell is transferred to the new actor and the original actor is no longer responsible for moving that cell. Because an entire row, not a sub part, is owned by an actor, there will be no error in the calculation of cell overlaps and row lengths (second and third cost function components in Section 3.1) during the simultaneous evaluation of multiple moves. Note that this approach assumes that the number of rows is greater than or equal to the number of processors. If not, the rows must be split into a number of subrows, in which case some overlap penalties may be calculated erroneously. Though this row-based partitioning is rather naive, it is sufficient since the early high temperature regions of the simulated annealing algorithm will cause the cells to be randomly spread across the circuit.
After partitioning, each Anneal actor can proceed with the annealing algorithm outlined in Figure 4 . A valid cell is selected for perturbation, and then a displacement or exchange is performed on that cell. As detailed below, there are two sub-classes of moves for both displacement and exchange, or four move types in total. The move type is determined by the intended location of the selected cell A.
M1. Intra-actor Cell Displacement.
Cell A moves to a new location owned by the same actor.
M2. Intra-actor Cell Exchange. Two cells A and B owned by the same actor exchange their locations.
M3. Inter-actor Cell Displacement.
Cell A moves to a new location owned by a different actor.
M4. Inter-actor Cell Exchange. Two cells A and B owned by different actors are exchanged.
An example of each type of move is shown in Figure 5 . In the figure, assume that each row is owned by different If a move is accepted, then the accepting actor must send the move to the Circuit aggregate so a consistent cell position database can be maintained. In order to amortize the startup cost of sending a message, position update messages are held until a number of moves have been accepted. Although this reduces the total number of Update messages sent among processors, there is a drawback in this approach. As the frequency of Update messages is reduced, the local cell position database on each Circuit representative becomes increasingly inaccurate, thereby causing the cost function calculation error to increase as well. This error, if too large, may prevent the algorithm from converging to an optimal solution. Previous researchers [4, 11, 38] have shown that simulated annealing is tolerant to some error in cost function calculations. In ProperPLACE-PM, the frequency of sending position Update messages is determined adaptively such that the error in the cost function is kept small at all times. This will be discussed in detail in Section 5.2.
Since actor methods are non-blocking, the actor's annealing process must give up control every so often to allow the aggregate to gain computation time to perform the updates. Therefore, a limit, M is placed on the number of moves that may be performed in succession without interruption. The Circuit aggregate can then process any waiting Update messages to keep the local database up-to-date. Anneal will have rescheduled itself by sending itself a Continue message that will enable control to come back to the Anneal actor and the next set of moves can then be proposed and evaluated.
After broadcasting its set of moves through the aggregate, an Anneal actor does not wait idly until all the Update messages sent by other actors have been processed, but it goes ahead with the next sequence of simulated annealing moves. The advantage of this asynchronous approach is illustrated in Figure 6 . Figure 6b shows a synchronous approach to parallelization in which actors finishing a block of moves must wait for slower actors to finish. The time to evaluate different moves is not the same, leading to some actors remaining idle (shown by the dark rectangles), and thus reduc- Figure 6a , actors become idle only at the end of the entire simulated annealing procedure. The overall idle time will have been reduced leading to greater speedup than the synchronous method. However, synchronization does offer an advantage in that the error in the cost function calculation becomes zero at each synchronization barrier, thereby making error control much easi er. In the asynchronous approach an effective error control scheme (which is discussed in Section 5.2) is necessary to bound the accumulated error in the system.
Prioritized messages
The ProperCAD II library provides prioritized messages whereby the programmer can influence the order in which messages in the work pool are picked for processing by assigning priorities to them. Prioritized execution is instrumental in delivering the performance presented in Section 5.4 for our placement algorithm. In this section, we describe the use of priorities both to reduce the runtime and to improve the quality of solution. The reduction in runtime is achieved by reducing the time taken by the inter-actor exchange move (M4), which is considerably longer than that of all the other We give the highest priority to the AskPermission and ReturnAnswer messages. This reduces, first, the probability of cell B, involved in the inter-actor exchange move, being moved to another location while the AskPermission message is received by the owner of cell B and, secondly, the time that cells remain frozen, i.e. prevented from moving.
Messages of type Update are given the next highest priority. Finally, the message of type Continue are given the lowest priority. By giving a higher priority to Update than Continue, all Update messages received by a processor are picked up from the message queue before the Continue message. Therefore, the most up-to-date cell location information is always used at the beginning of each block of M moves.
Dynamic message sizing and error control
To reduce the communication overhead, Update messages are broadcast periodically after accumulating a number of accepted moves, not after each move. Now, the problem is to determine the frequency of this update message, or equivalently, the number of attempted moves between updates (M ). We are concerned with the number of accepted moves between updates or in other words jUj, the size of the Update message. If jUj is too large, accepted moves by one actor do not appear on another processor's database on time. Consequently, the error in the cost function calculation increases and results in further degradation of the solution quality. Such a cumulative error in cost function has also been recognized as a severe problem by several researchers [4, 6, 10] . On the other hand, if the message size is too small, the number of messages sent is increased and results in large communication overheads.
A naive approach to determining jUj is to use a statically preset value. The problem with this static approach is that there is no good way to determine an optimal jUj a priori. In a dynamic approach, jUj is determined during the annealing process by monitoring the size of the error in the cost function in the system. If the size of error during the annealing process becomes too large, jUj will be reduced to decrease the error at the cost of an increased number of messages. Likewise, if the error becomes very small, jUj is increased to reduce the number of messages. As long as the size of the error is bounded, this dynamic approach will produce an equivalent quality solution to that of a sequential simulated annealing algorithm.
The error in the cost function is defined to be the difference between the real change in cost from the initial to final configurations, and the estimated change in cost equal to the sum of locally perceived changes in cost at each processor.
If C i is the exact initial cost, C j is the change in cost computed locally at n processors, and C f is the exact cost of the new configuration after a series of moves, then
where E all is the total accumulated error. In light of the fact that no synchronization takes place to exchange cell position information, C f is available only at the end of the annealing process at which time each processor has the identical copy of the entire circuit. Consequently, E all cannot be obtained during the placement process. Therefore, in an asynchronous approach, one can only approximate what E all will be during the placement process. In our algorithm, instead of estimating E all , we estimate, E, the error that each processor contributes by moving its own cells. The advantage of obtaining the error this way is that it can be calculated by each processor independently without any synchronization.
In Figure 7 (1 ? P(j));
where P(j) is the probability that cell j is moved during the time between accepting move i and sending the next This error accumulated over M moves is used to control jUj, the update message size. In ProperPLACE-PM, we put a bound on the error in the cost function as originally reported in [4] . The probability of accepting a move is: P = e ? C=T Prob( C > 0) + Prob( C < 0)
In the presence of error, the composite acceptance rate changes slightly; however, the probability of generating good or bad moves is invariant with respect to error: P E = e ?( C E)=T Prob( C > 0) + Prob( C < 0)
To bound the acceptance rate with error P E to within 5% of normal, i.e., P ? P E P 0:05 we find a bound on magnitude of error: E Tln(1:05) T
21
We decrease the message size by current size 2 (1 ? e ?E=k ), whenever the computed error E is higher than T/21.
The variable k is fixed at 0:0687 T in order to insure that at the boundary (E = T/21), the message size is decreased by current size
4
. If the error is very low, then the message size is increased similarly. 
Load balancing by inter-actor move suppression
In order to maintain the same number of total moves, the number of moves attempted in the inner iteration by each actor was reduced as shown below.
M par = M uni number of processors
where M uni is the number of moves attempted in TimberWolfSC 6.0.
Since the time to evaluate each move differs, some Anneal actors may perform annealing much faster than others.
Also, the number of cells owned by each actor may vary considerably because cells are allowed to move to other actors.
To maintain approximately the same number of cells owned by each actor, we need to rebalance the cells among actors.
In our algorithm, we achieve this balance in the number of cells by varying the type of moves proposed and accepted.
For example, when the number of cells falls below two third of the original assignment, we cut down the number of interactor moves. Because this reduces the probability that a cell moves out of this actor, the number of cells moving into this actor's region becomes greater than the number moving away. Therefore, the balance is maintained by this simple technique. Similarly, if an actor owns much more cells than the average, more inter-actor moves are proposed to increase the probability of cells moving out. This change in move types does not affect the placement solution.
Experimental Evaluation
In this section, we will describe results obtained by using ProperPLACE-PM on various circuits in the ISCAS and MCNC benchmark suite as well as one industry circuit (See Table 1 ). Whereas much of the previous work in parallel placement has used relatively small circuits for their evaluations, we are using reasonably complex circuits that are accepted in the placement community. As noted earlier, the need for parallel processing is only apparent for larger circuits. 
Results with prioritized messages and dynamic message sizing
In Table 2 , we compare the quality of placement obtained by TimberWolfSC 6.0 with that of ProperPLACE-PM in an uniprocessor (Sun 4/690MP) environment. By ensuring that the decision process for ProperPLACE-PM is identical to TimberWolfSC there is no loss in placement quality (the differences are due to the stochastic behavior of simulated annealing). The last columns compare the runtimes for TimberWolfSC and for ProperPLACE-PM, and again the times are comparable 1 . Most previous work on parallel placement has had to reimplement the sequential algorithm in a simplified way, and as a result the performances of those algorithms on a single processor were much inferior to the best sequential algorithm available.
We ran ProperPLACE-PM on a Sun SparcServer 1000, an Intel Paragon mesh, and a cluster of SparcStation 5
workstations. We would like to emphasize that the ProperPLACE-PM placement program ran unchanged on all these machines. It should be noted that all previous parallel algorithms for placement were targeted to specific parallel architectures. Table 3 From the tables it is clear that ProperPLACE-PM produces acceptable speedups, while maintaining quality comparable to that of TimberWolfSC 6.0. Please note that the times presented are all wall clock times not CPU times, and thus are subject to interference due to other loads on the system -particularly for the Sun machines, since they are available for general use within the group. Also, we were unable to run the largest circuits on the Paragon because of memory limitations. We are aware of limitations in the ProperCAD II library with respect to workstation cluster environments, that prevent us from running large circuits and we are in the process of investigating these issues.
Effect of synchronizing barriers on quality and speedup
The results we have just presented are for an asynchronous parallel moves algorithm with no synchronization barriers.
In this section, we demonstrate the effect of synchronization barriers on the speedup and the quality of the placement produced.
In this experiment, we force the annealing actors to synchronize after every 40 moves (See Figure 6a) . As discussed earlier, such synchronization barriers cause actors to wait for the slowest actors to perform its task, but provide perhaps better control of the error. Table 6 
ProperPLACE-MMC Multiple Markov chains
The ProperPLACE-PM algorithm we investigated above has been shown to be effective at producing cell placement solutions with speedups at moderate losses of quality. However, there are situations where no loss in quality can be afforded, and the following two sections present parallel algorithms intended to address that problem.
ProperPLACE-MMC uses the concept of multiple Markov chains, first presented as parallel clustered statistical cooling by Aarts et al [36, 41] . It was further refined by Lee and Lee who introduced an asynchronous approach to this methodology, and in particular applied the algorithm to the graph partitioning problem for shared memory multiprocessors [7, 42] . The algorithm can be understood if the sequential simulated annealing procedure is considered as a search path where moves are proposed and either accepted or rejected depending on particular cost evaluations and also a starting random seed. The search path is essentially a Markov chain, and parallelization is accomplished by initiating different chains (using different seeds) on each processor. Each chain then explores the entire search space by independently performing the annealing perturbation, evaluation and decision steps. After each processor has completed the annealing schedule, the solutions are compared and the best is selected. Rose et al used a similar approach with the min-cut algorithm in the high-temperature region of the simulated annealing schedule [5] . Note that this differs from parallel moves in that each chain is allowed to perform moves on the entire set of cells and not just a subset.
Of course, there is no speedup in this approach since each processor is individually performing the same amount of work as the sequential algorithm. To achieve speedup, we must reduce the number of moves evaluated in each chain by a factor of 1 N where N is the number of processors. Since the number of moves determines the run time of the program, a reduction by a factor of 1 N will cause a speedup of N. Obviously, such a reduction alone is not appropriate since the quality will likely decrease accordingly. To take advantage of the fact that multiple processors are being used, some means of interaction between the various chains is necessary.
Synchronous multiple Markov chains
One possible interaction scheme, called synchronous MMC with periodic exchange by Lee and Lee, is to periodically compare solutions at fixed intervals. This method allows each Markov chain to update its local database with the best solution, and then continue on. This exchange point serves as the end of a segment of computation, and behaves as a barrier synchronization point. According to the algorithm proposed by Aarts et al, the exchange point occurs after every move. At the barrier, various application specific metrics can then be used to determine the best solution. When applied to the TimberWolfSC placement tool, a natural point for solution exchanges is the end of each TimberWolfSC iteration.
In an actor framework such as ProperCAD II, each chain or search path is represented by a separate actor or thread of control. Since the actor model can not assume a shared memory architecture, solution updates must be done with message sends. The barrier at the end of each segment is implicitly achieved through the use of these messages, as shown in Figure 9 . When an actor has reached the end of its segment, it propagates a solution metric up to a master actor through a reduction tree. This metric is only a cost measurement of the solution and is not the entire global state. The master thread determines the best solution, and then directs the actor with the best solution to broadcast its state to all other actors. In the example in Figure 9 , actor 3 is determined to have the best solution.
The barrier could be implemented in a single phase manner if each actor propagates its entire state to the master rather than just the cost metric. The master could then broadcast the state itself rather than request the winning actor to broadcast it. However, because of the size of the state in cell placement, the two phase method is more efficient. The implementation proposed by Lee and Lee can use a single phase by transferring the entire state through shared memory. Table 7 shows the message size needed to transfer the relevant circuit state for a sampling of circuits.
Asynchronous multiple Markov chains
From examining Figure 9 , it is obvious that barriers can be costly operations; an asynchronous approach is preferred. Figure 10 describes the actor interface for such an implementation. Notice that very few modifications have to be made to support the asynchronous method. In this approach, the master actor does not perform any computation. Instead, it serves as a location for the best available solution at any particular time. When an actor has completed an iteration, it sends its solution metric to the master actor, and requests the best solution available. The master thread on receipt of this request will determine if the received solution is better than the local "best" solution. If it is better, the master will ask the requestor to send its state back. Remember that the requestor had only sent the metric initially. The requestor will then send its local state back to the master and continue with the next iteration with its own local state. If the master has determined that the received solution is worse than the best solution, the master will simply send the current best state to the requestor. At the cost of dedicating an extra processor for "master" usage, this asynchronous approach can eliminate much of the idle time that was present earlier.
Graphically, this algorithm is illustrated in Figure 11 . For example, Actor 2 has finished its segment and sends its solution metric to the master which determines that its solution is the best and sends a message back requesting Actor 2's local state. While waiting for Actor 2's state, the master has in the mean time received a solution metric from Actor 1.
Since it hasn't yet received Actor 2's state, the master must compare Actor 1's metric with the previous state. It then determines that Actor 1 has an inferior solution, and thus sends the previous state back to Actor 1. Note, because of the asynchronous behavior, Actor 1 was not able to receive the best solution at that current point. This type of erroneous update is acceptable, since the actors can correct themselves at future iterations, and it also provides an opportunity to escape from local minima.
The execution time, t ser , of a serial run of TimberWolfSC can be stated as It m , where t m is the time to perform a move and I is the number of moves attempted. In an asynchronous MMC implementation, the run time becomes On an Intel Paragon, t m is on the order of 50 to 100 s, while t b is only 0.018 s. Thus it can be seen that t c is negligible, and the expected speedup should approach N.
Experimental observations
We examined the effectiveness of the asynchronous MMC algorithm on the Sun SparcServer 1000E as well as for the Intel Paragon as shown in Tables 8 and 9 . We start at 2 processors because of the need for a master processor. The quality of the solutions show no degradation as the number of processors increase -in fact, they sometimes show improvements because of the periodic exchange of solutions. The algorithm scales much better than the parallel moves approach. As expected, communication time is not significant since the number of messages sent is few. Note that the speedup and quality are much better than can be achieved with a pure parallel moves strategy as presented in the previous section.
If speed improvement is not the goal, ProperPLACE-MMC can be used in another mode to provide better solutions for the same run time. Table 10 compares TimberWolfSC on one processor with ProperPLACE-MMC on four processors. The solution quality is improved by an average of 9% at a cost of only 3% in runtime. To achieve that type of quality improvement in TimberWolfSC would require significantly more computation time.
Sun and Sechen have used a modified parallel moves approach to achieve similar results [40] . Though they use parallel moves and partitioned circuits, at each iteration, like with multiple Markov chains, solutions are exchanged among processors. They do not need to process frequent update messages. Another approach recently suggested for generalized parallel simulated annealing is speculative computation [9] . Witte, et al, applied this approach to the task assignment problem and found speedups approaching log 2 P, where P is the number of processors. In this section, we apply the concept of speculative computation to cell placement and determine the applicability of such an approach. We first give a brief description of the algorithm.
Generalized speculative computation
A sequential simulated annealing schedule is simply a series of move proposals intended to reduce some cost function as related to the particular problem. Each move consists of three parts -the proposal or perturbation, evaluation, and decision. Only after these three parts are completed, is the next move started. Since the decision made by the next move is dependent on the current state as determined by prior moves, simulated annealing is almost inherently serial in nature.
Consider the decision tree of moves in Figure 12a . The top node represents a move attempted in a simulated annealing process. There are two possible decisions as a result of this move -acceptance or rejection. Speculative computation will assign two different processors to speculatively work on the two possibilities before the parent move has completed.
The rejection child can start at the same time as the parent, since it will assume that the state has not changed. After the parent has completed the move proposal, it can then relay the new state to the accept processor.
As the acceptance characteristics of the procedure varies, the shape of the tree can also change. For example, if the acceptance rate is high, it would make sense to generate a linear tree of only acceptance nodes, and on the other hand a very low acceptance rate would imply the creation of only rejection nodes (Figures 12b and c) . This latter mode is essentially the mode in which the algorithms proposed by [3] and [43] operate.
Speculative computation for placement
Since speculative computation seems to be a promising avenue to achieve at least some speedup in the high temperature region, we decided to investigate the applicability of such an approach to the cell placement problem. The problem fits naturally into an actor based framework, in that each speculated move can be represented by an actor. One of the major changes we made to the algorithm was to add some asynchronous behavior by removing the need for a centralized root processor that was required to start off each tree. Eliminating this synchronization point allows multiple speculative trees to be active at once. Indexing was used to properly order the execution of trees.
Another major modification made to Witte's algorithm was to transfer just the move proposal to the accept child rather than the entire state after the move. That is, if the root node were to propose moving a cell to a new location, it would convey the cell number and new location to the accept child, and the child would be responsible for duplicating the move.
As with multiple Markov chains, this decision was made because of the potentially large size of the cell state. Once a speculative move has been determined to be false, the actor responsible for the move must then abort its move as well as all its parent's moves. It must then update its cell database with the moves from the correct path.
After the modifications were made, the algorithm was run on a variety of circuits as shown in Table 11 . As can be seen from the table, the wirelengths are identical, as expected. The speedups, however, are disappointingly poor, as indicated by the drastic slowdowns. The primary reason for this behavior is the faultiness of Witte's model when applied to cell placement -particularly TimberWolfSC. Witte makes two main assumptions -the time to perform and propose a move is small compared to the evaluation of the move, and secondly, the outcome of that move or the resultant state is easily communicable. Consider the following optimistic analysis -for parallel speculative computation, the execution time is t par = I L ((L ? 1)(t c + t sm ) + t m ) The term (L?1)(t c + t sm ) is the cost due to performing parent's moves. If this term is small, then the speedup is simply L, which in the extreme temperature regions is N, and at worst is logN. However, our measurements have shown that this cost is not negligible. Our measurements show that the speculative move time (t sm ), the time that a child node needs to perform its parent's move, is comparable with a non-speculative move. The other component, , is also very high due to the relatively high acceptance rate of TimberWolfSC. The combination of these factors leads to a very high overhead for parallel speculative computation. In light of these problems, it is clear that speculative computation, though presented as a generalized parallel simulated annealing algorithm, is not a feasible approach to parallelizing cell placement.
Conclusions and future work
We have investigated the applicability of three different parallel simulated annealing strategies to the problem of standard cell placement. The first strategy, parallel moves, based on the use of priorities and a dynamic message sizing was used to deliver good consistent speedups with little degradation in the wire length. Multiple Markov chains appears to be promising as a means to achieving moderate speedup without losing quality and in fact in some cases improving quality.
Speculative computation, however, is shown to be inadequate as a means of parallelization of cell placement. A combination of the parallel moves approach with intermediate exchanges as in multiple Markov chains may offer benefits in terms of reducing the error present in the parallel moves approach alone.
With all the parallel strategies, we have demonstrated that it is possible to design and implement a portable parallel placement tool that is cleanly and efficiently interfaced with a simulated annealing-based uniprocessor algorithm using the ProperCAD II environment. The parallel placement tool runs unchanged on a range of MIMD machines including shared memory machines, distributed memory machines and a networks of workstations. We believe that this is the first parallel placement application to effectively exploit shared memory, non-shared memory machines and a network of workstations in this manner. Another important feature of ProperPLACE is that future improvements to the uniprocessor algorithm will result in improvements to the parallel performance of ProperPLACE with minimal effort. It is only necessary to keep the interface between the uniprocessor and parallel placement algorithms unchanged.
Acknowledgements
Our sincere thanks to Dr. Carl Sechen for providing us with the source code to TimberWolfSC 6.0. We would also like to acknowledge the support of the San Diego Supercomputing Center for granting us access to their and Intel Paragon.
