Abstract-Simulated Annealing (SA) is a popular iterative chain approaches attempt to exploit parallelism between the heuristic used to solve a wide variety of combinatorial optimiza-three phases. They include move-acceleration, parallel-moves, tion problems. However, depending on the size of the problem, it and speculative annealing. 
heuristic used to solve a wide variety of combinatorial optimiza-three phases. They include move-acceleration, parallel-moves, tion problems. However, depending on the size of the problem, it and speculative annealing. These techniques are generally may have large run-time requirements. One practical approach to speed up its execution is to parallelize it. In this paper we de-more sultable for shared-memory envlronments. Approaches velop parallel SA schemes based on the Asynchronous Multiple-based on Multiple Markov chains call for the concurrent Markov Chain model (AMMC) described in [1] and applied to execution of separate simulated annealing chains with periodic standard-cell placement in [2] . The schemes are applied to solve exchange of solutions [1] , [2] , and are ideally suited for the multi-objective standard cell placement problem using an distributed memory systems, considering that the need for inexpensive cluster-of-workstations environment. This problem . . ' . requires the optimization of conflicting objectives (interconnect communication between nodes is considerably reduced.
wire-length, power dissipation, and timing performance), and In our work, we experiment with different versions of the Fuzzy logic is used to integrate the costs of these objectives Asynchronous Multiple-Markov Chain Parallel SA (or AMMC [3] , [4] . Experiments are performed on ISCAS-85/89 benchmark PSA) approach described in [1] , as this scheme is most suitable circuits. Our goal is to develop parallel SA schemes that provide for a basic VLSI cell placement problem being solved using significantly improved runtime/solution quality characteristics for this key CAD problem, by making the best possible use of an a cluster-of-workstatons environment [2] . inexpensive parallel environment.
II. PARALLELIZATION SCHEMES AND RESULTS

I. INTRODUCTION A. Experimental Setup and Placement Cost-Functions
With VLSI technologies moving towards ever more complex
The experimental setup consists of a homogenous cluster of submicron-scale circuit fabrication, there is a demanding need 8 Pentium-4 2 GHz machines, and 256 MB of memory. These for iterative heuristics that can deliver quality solutions in machines are connected by lGbit/s ethernet switch. Operating feasible runtimes. This is especially true with the often con-system used is Redhat Linux 7.3 (kernel 2.4.7-10). The algoflicting, multiple objectives that have be addressed. Stochastic rithms were implemented in C/C++, using MPICH ver. nature with three design objectives namely, interconnect wireOne way to adapt iterative techniques to solve large prob-length, power consumption, and timing performance (delay). lems and traverse larger search spaces in reasonable time is to The layout width is taken as a constraint. Fuzzy logic was parallelize them [5] , [6] . The eventual goal being to achieve used for designing an aggregating cost function, allowing us either much lower run-times for same quality solutions, or to describe the objectives in terms of linguistic variables. Then, higher quality solutions in a fixed amount of time. This fuzzy rules are used to find the overall cost, or 'quality' of a however is easier said than done: an effective parallelization placement solution. This quality measure is a value between 0 strategy must consider issues such as proper partitioning of the and 1, with 1 representing an optimal solution. A detailed problem to facilitate uniform distribution of computationally description of our combinatorial optimization problem and intensive tasks, and enabling a more thorough traversal of the cost functions is given in [3] . complex search space, all the while respecting the benefits and limitations of the selected parallel environment.
B. Attempted Parallelization Strategies
Parallel Simulated Annealing has been the subject of thorRecent work has shown that the most promising scheme for ough exploration since it was first proposed. Virtually all the parallelization of SA for the VLSI Placement problem in known methods of parallelization for Simulated Annealing a cluster-of-workstations environment was the Asynchronous can be classified into one of two groups: Single Markov-chain MMC model [1], [2] . The primary goals of our experiments and Multiple Markov-chain methods [1] . Most Single Markov were to explore the potential for improvements in runtime 0-7803-9390-2/06/$20.00 ©2006 IEEE and achievable solution quality, by making the most effective Here we see that speedup achieved using Strategy 1 is subutilization of the parallel environment. Successive parallel linear. Even with 8 processors, we are unable to even achieve strategies attempt to incrementally build upon the knowledge a speedup of 3. gathered from the previous schemes in order to achieve the 2) Asynchronous MMC Parallel SA Strategy 2: While Stratgoals of parallelization more effectively. egy 1 is able to meet and even surpass the qualities achieved by The basic structure of our AMMC PSA implementation is the serial algorithm, its runtime characteristics leave something similar to the scheme described in [2] . On each available to be desired. Strategy 2 is an attempt to provide near linear processing element, an SA operation is initiated with the speedup over the serial version. This is accomplished by same starting solution, but with different seeds for pseudo-dividing the amount of work done at each of the individual randomization. The specifications of our AMMC parallel processes by the total number of processes. Specifically, the search implementation of SA are given below: number of Metropolis iterations at each process is divided by 1) The Information Exchanged: The entire recent best the total number of processes.
solution is communicated to slave processes. Table I shows the results obtained from experiments with shortened Markov-chains on achievable quality by making Strategy 1 for the benchmark circuits listed in column 1. The intelligent use of the interaction between chains that occurs third column lists the highest quality achieved by the serial after every Metropolis loop. version of the algorithm. The remaining columns list the time This is different from the Temperature Parallel Simulated taken to achieve the specified quality, with the given number of Annealing (TPSA) approach described in [7] , which maintains processors. Using Strategy 1, we were always able to exceed all the parallel processes at constant but different temperatures. the quality achieved by the serial version. Figure 1 (a) shows Whereas in Strategy 3, the values of alpha is different on the speedups achieved by Strategy 1, for the same quality, different processors, thus the rate of temperature change is with different number of processors and for different circuits. varied across processors. This is because our intended goals 2) Initially, the value of 'M' is set to a very small valuerequired. To this end, we ran several experiments on both the the value used in the basic algorithm is divided by 25 serial and parallel (7 processors) versions, keeping all things to provide the initial M in the adaptive version. constant except M, which was divided by 9, 17, 25, and 57 3) After the initial average accumulation iterations, adaprespectively for each new run. The Quality vs. Runtime results tivity is initiated. If rate of improvement drops below for the runs on 7 processors are given in Figure 2 . a certain threshold, increase M incrementally, since not 
