Asynchronous MMC based Parallel SA Schemes for Multiobjective Standard Cell Placement by Sait, Sadiq M. et al.
Asynchronous MMC based Parallel SA Schemes
for Multiobjective Standard Cell Placement
Sadiq M. Sait Ali Mustafa Zaidi Mustafa Imran Ali
College of Computer Sciences & Engineering
King Fahd University of Petroleum & Minerals
Dhahran 31261, Saudi Arabia
E-mail: {sadiq, alizaidi, mustafa}@ccse.kfupm.edu.sa
Abstract- Simulated Annealing (SA) is a popular iterative chain approaches attempt to exploit parallelism between the
heuristic used to solve a wide variety of combinatorial optimiza- three phases. They include move-acceleration, parallel-moves,
tion problems. However, depending on the size of the problem, it and speculative annealing. These techniques are generally
may have large run-time requirements. One practical approach
to speed up its execution is to parallelize it. In this paper we de- more sultable for shared-memory envlronments. Approaches
velop parallel SA schemes based on the Asynchronous Multiple- based on Multiple Markov chains call for the concurrent
Markov Chain model (AMMC) described in [1] and applied to execution of separate simulated annealing chains with periodic
standard-cell placement in [2]. The schemes are applied to solve exchange of solutions [1], [2], and are ideally suited for
the multi-objective standard cell placement problem using an distributed memory systems, considering that the need for
inexpensive cluster-of-workstations environment. This problem. . ' .
requires the optimization of conflicting objectives (interconnect communication between nodesis considerably reduced.
wire-length, power dissipation, and timing performance), and In our work, we experiment with different versions of the
Fuzzy logic is used to integrate the costs of these objectives Asynchronous Multiple-Markov Chain Parallel SA (orAMMC
[3], [4]. Experiments are performed on ISCAS-85/89 benchmark PSA) approach described in [1], as this scheme is most suitable
circuits. Our goal is to develop parallel SA schemes that provide for a basic VLSI cell placement problem being solved using
significantly improved runtime/solution quality characteristics for
this key CAD problem, by making the best possible use of an a cluster-of-workstatons environment [2].
inexpensive parallel environment. II. PARALLELIZATION SCHEMES AND RESULTS
I. INTRODUCTION A. Experimental Setup and Placement Cost-Functions
With VLSI technologies moving towards ever more complex The experimental setup consists of a homogenous cluster of
submicron-scale circuit fabrication, there is a demanding need 8 Pentium-4 2 GHz machines, and 256 MB of memory. These
for iterative heuristics that can deliver quality solutions in machines are connected by lGbit/s ethernet switch. Operating
feasible runtimes. This is especially true with the often con- system used is Redhat Linux 7.3 (kernel 2.4.7-10). The algo-
flicting, multiple objectives that have be addressed. Stochastic rithms were implemented in C/C++, using MPICH ver. 1.2.4.
Heuristics such as Simulated Annealing are very useful for ISCAS-89 circuits are used as performance benchmarks for
obtaining near optimal solutions for this class of problems. evaluating the parallel metaheuristics.
However, there is the pressing issue of huge runtime require- Our placement optimization problem is of a multiobjective
ments that can grow exponentially with circuit size. nature with three design objectives namely, interconnect wire-
One way to adapt iterative techniques to solve large prob- length, power consumption, and timing performance (delay).
lems and traverse larger search spaces in reasonable time is to The layout width is taken as a constraint. Fuzzy logic was
parallelize them [5], [6]. The eventual goal being to achieve used for designing an aggregating cost function, allowing us
either much lower run-times for same quality solutions, or to describe the objectives in terms of linguistic variables. Then,
higher quality solutions in a fixed amount of time. This fuzzy rules are used to find the overall cost, or 'quality' of a
however is easier said than done: an effective parallelization placement solution. This quality measure is a value between 0
strategy must consider issues such as proper partitioning of the and 1, with 1 representing an optimal solution. A detailed
problem to facilitate uniform distribution of computationally description of our combinatorial optimization problem and
intensive tasks, and enabling a more thorough traversal of the cost functions is given in [3].
complex search space, all the while respecting the benefits and
limitations of the selected parallel environment. B. Attempted Parallelization Strategies
Parallel Simulated Annealing has been the subject of thor- Recent work has shown that the most promising scheme for
ough exploration since it was first proposed. Virtually all the parallelization of SA for the VLSI Placement problem in
known methods of parallelization for Simulated Annealing a cluster-of-workstations environment was the Asynchronous
can be classified into one of two groups: Single Markov-chain MMC model [1], [2]. The primary goals of our experiments
and Multiple Markov-chain methods [1]. Most Single Markov were to explore the potential for improvements in runtime
0-7803-9390-2/06/$20.00 ©2006 IEEE 4615 ISCAS 2006
and achievable solution quality, by making the most effective Here we see that speedup achieved using Strategy 1 is sub-
utilization of the parallel environment. Successive parallel linear. Even with 8 processors, we are unable to even achieve
strategies attempt to incrementally build upon the knowledge a speedup of 3.
gathered from the previous schemes in order to achieve the 2) Asynchronous MMC Parallel SA Strategy 2: While Strat-
goals of parallelization more effectively. egy 1 is able to meet and even surpass the qualities achieved by
The basic structure of our AMMC PSA implementation is the serial algorithm, its runtime characteristics leave something
similar to the scheme described in [2]. On each available to be desired. Strategy 2 is an attempt to provide near linear
processing element, an SA operation is initiated with the speedup over the serial version. This is accomplished by
same starting solution, but with different seeds for pseudo- dividing the amount of work done at each of the individual
randomization. The specifications of our AMMC parallel processes by the total number of processes. Specifically, the
search implementation of SA are given below: number of Metropolis iterations at each process is divided by
1) The Information Exchanged: The entire recent best the total number of processes.
solution is communicated to slave processes. Table II shows the results obtained from experiments with
2) Connection Topology: The parallel processes communi- Strategy 2. Unlike the previous table, the third column here
cate via a central solution storage area, where the best shows the highest common quality that could be achieved by
solution found so far is kept. The master process is multiple runs of Strategy 2 for every number of processors.
reserved for this purpose. Comparing with column 3 of Table I, we can easily note that
3) Communication Mode: Communication is there is a roughly 10 to 15% drop in achievable quality with
asynchronous. Thus communication time is minimized this scheme. Figure 1 (b) shows the speedups achieved by
since there are no synchronization barriers. Each Strategy 2 as the number of processors is varied. We see that
process communicates with the master independently in this case, speedup is almost linear.
and compares its own best solution with the solution TABLE II
residing at the master. If the master owns the better RESULTS FOR STRATEGY 2
solution,the slave starts its next Metropolis loop Circuit Number 148) Serial Time for Parallel SA Strategy 2
with this solution, while the master's copy remains Name of Cells SA SA Time P=3 p=4 p=5 p=6 p=7
sl196 561 0.630367 103 44.67 31.32 22.81 18.47 16.46
unchanged. Conversely, if the slave has the better s1238 540 0.630573 117 58.03 39.21 26.31 22.31 19.73
I s1488 667 0.582884 101 42.67 25.59 18.77 16.61 15.85
solution, it continues its work after the master has s1494 661 0.591114 75 51.11 30.79 22.32 15.82 14.9
received this latest best solution, which is then available
for comparison by the other slave processes. 3) Asynchronous MMC Parallel SA Strategy 3: The loss
4) Time to Exchange Information: Each process works on in achievable quality in Strategy 2 can be understood by
a recent best solution retrieved from the central store for looking at how the intelligence of the algorithm is affected by
the duration of its Metropolis loop. division of the factor 'M'. All of the parameters of the cooling
schedule were originally optimized for the serial Simulated
We implement four distinct versions of the Asynchronous Aneln. Sic SA covrec ihglysntveote
MultipleMarkovCha ns approach.Annealing. Since SA convergence iS highly sensitive to the
Markovn sMMCPapproach.SA cooling schedule, it is understandable that such a drastic1) Asynchronous MMC Parallel SA Strategy 1: For Strat- chnetoeofisprmeswulrsltnlwrqaiy
egy1,asdefro te boe oins,threis o ifernc change to one of its parameters would result in lower qualityegy 1, aside f m the above points, er is no difference solutions. Division of 'M' reduces the amount of time each
between the serial version and each of the parallel search processor spends searching for a better solution in the vicinity
processes. Such an approach has been found to improve of a previous good solution, resulting in a less thorough
solution qualities in a fixed amount of time [1], and our results parallel search of the neighboring solution space.
corroborate this fact. prle erho h egbrn ouinsae
TABLE I In Strategy 3, we attempted to offset the negative impact
on algorithmic intelligence by introducing other enhancements
RESULTS FOR STRATEGY 1
to the parallel algorithm. This was done by implementing
Circuit Number 148) Serial Time for Parallel SA Strategy 1Name of Cells SA SA Time p=3 p=4 p=5 p=6 p=7 different cooling schedules on each processor in such a way
s1196 561 0.675340 190 145.98 130.95 110.31 96.98 98.24
s1238 540 0.699469 212 183.91 130.32 127.55 117.12 114.66 that some of the processors are searching for new solutons in a
s1488 667 0.650381 275 151.46 118.44 112.59 98.94 92.26 greedy manner, while others are still in the high temperature
region. We essentially aim to counterbalance the impact of
Table I shows the results obtained from experiments with shortened Markov-chains on achievable quality by making
Strategy 1 for the benchmark circuits listed in column 1. The intelligent use of the interaction between chains that occurs
third column lists the highest quality achieved by the serial after every Metropolis loop.
version of the algorithm. The remaining columns list the time This is different from the Temperature Parallel Simulated
taken to achieve the specified quality, with the given number of Annealing (TPSA) approach described in [7], which maintains
processors. Using Strategy 1, we were always able to exceed all the parallel processes at constant but different temperatures.
the quality achieved by the serial version. Figure 1 (a) shows Whereas in Strategy 3, the values of alpha is different on
the speedups achieved by Strategy 1, for the same quality, different processors, thus the rate of temperature change is
with different number of processors and for different circuits. varied across processors. This is because our intended goals
4616
Strategy1 Spedup~vsNme fPoesr Strtegy2SpeedupvsNme fPoesr Strtegy3SpeedupvsNme fPoesr
3.50 8 8
3.00 7 7
2.50 P*p6p pE *3 3
2.00 p=4F ol.4 5
p-, El ~~~~~p5 4 Elp=5 *4 Elp=5
cL 1.50 *p=6 ,_P1 p Ep=6 ,_Ep=6




s1196s1238 s1488 s1494 s1196 s1238 s1488 s1494 s1196 s1238 s1488s1494
Circuits Circuits Circuits
(a) (b) (c)
Fig. 1. Speedup versus number of machines for (a) Parallel SA AMMC Strategy 1; (b) Parallel SA AMMC Strategy 2; (c) Parallel SA AMMC Strategy 3.
are different from those of TPSA. Whereas our primary aim quality. Intuitively this would suggest that the M factor should
is to achieve serial-equivalent qualities while achieving near- start at a small value, and then should increase as solution
linear runtimes, the aim of TPSA was primarily to enhance the quality rises. However, a balance is necessary: if M increases
robustness of Parallel SA, and minimize the amount of effort too fast, runtime is compromised; if M increases too slowly,
required in parameter setting. achievable solution quality is affected. The key to this dilemma
However, we find that even this proposed enhancement of approximating the appropriate value of M comes from an
of varying alpha is insufficient to counteract the impact of interesting observation made during these runs: during the
divided 'M'. Our results for Strategy 3, shown in Table III and steep improvement phase the rate of improvements to solution
Figure 1(c) show no improvement over the results obtained for quality is constant per metropolis call - meaning that during
Strategy 2 - for some circuits (e.g. si1 196), there is even a drop the initial phase, the high rate of climb is primarily due to the
in achievable speedup and quality, short time spent in each metropolis call.
TABLE III Based on what we have learned from these experiments,
RESULTS FOR STRATEGY 3 we proposed certain modifications to the cooling schedule of
Circuit Number I( s) Serial Time for Parallel SA Strategy 2 our basic, serial Simulated Annealing algorithm. This adaptiveName of Cells SA SA Time p=3 p=4 p=5 p=6 17 cooling schedule, when implemented for the parallel AMMC
s1196 561 0.606818 64 38.85 20.40 18.68 15.41 13.55
s28 540 0.630573 117 65.36 26.65 22.65 19.39 18.04 scheme, yielded our 4th parallel search SA strategy. A brief
s1488 667 0.582884 101 43.71 18.46 15.96 14.49 13.29
s441661 10.591114 175 142.89 20.05 17.92 13.86 13.67 description of the adaptive cooling schedule is given below:
4) Asynchronous MMC Parallel SA Strategy 4 - Adaptive 1) For the first 100 or so Annealing iterations, accumulate
Cooling Schedule: From the results of the previous three an average of the quality improvement per Metropolis
strategies, it became evident that for parallel SA, if any function call. This average rate of improvement will
progress is to be made towards achieving our goals of near- serve as a threshold that needs to be maintained per
linear run times with sustained quality, an in depth study of Metropolis Function call.
the impact of parameter M on achievable solution quality is 2) Initially, the value of 'M' is set to a very small value -
required. To this end, we ran several experiments on both the the value used in the basic algorithm is divided by 25
serial and parallel (7 processors) versions, keeping all things to provide the initial M in the adaptive version.
constant except M, which was divided by 9, 17, 25, and 57 3) After the initial average accumulation iterations, adap-
respectively for each new run. The Quality vs. Runtime results tivity is initiated. If rate of improvement drops below
for the runs on 7 processors are given in Figure 2. a certain threshold, increase M incrementally, since not
Parlll unChaacerstcs orDifeen DiisonFator, agifedenough time is being spent at each temperature level.
0.7 4) If rate of improvement is constantly more than the
0.6 17 threshold vau,dces , snean unnecessary
0.5 Ftr.amount of time is being spent at the given quality level.
2' 0.4 5) The value of the M parameter is not allowed to exceed
0.3 257 twice the value used in the original basic version, until
0.2 significant stagnation is detected (e.g.: no improvement
0.1 in solution quality for the past 25 Metropolis calls).
0 5 10 1 20 20 51Tim 15 2 5The application of the last condition was empirically found
to dramatically improve algorithm run times, without sacrific-
schemes for achieving the quality targets set by Strategy 2. For M reduces the amount of time each processor spends searching
all runs and all circuits on any number of processors, Strategy for a better solution in the vicinity of a previous good solution,
4 manages to achieve significantly higher qualities than either resulting in a less thorough parallel search of the neighboring
Strategy 1 or Strategy 2. solution space. Even the proposed enhancement of varying
TABLE IV other parameters across other processors, as done in Strategy
RESULTS FOR ADAPTIVE STRATEGY 4 (STRATEGY 1 QUALITIES) 3, is insufficient to counteract the impact of divided M.
Circuit Number 148 Serial ___Time for Parallel SA Strategy 2- Strategy 4 parallel SA approach was implemented afterName of Cells SA SA Time p= 1= p=1 p =6 p7 acareful suyof the imatof varying M n achievable
____6 561 j0.675340~ 75.4 ~60.31] 47.87j 47.34146.25142.44~ td mato
s1238 540 0.699469 115.9 96.45 ]84.21 ]67.59 63.05 53.79 solution quality. The adaptive nature of the cooling schedule
s1488 667 T 0.650381 106.6 77.84 ] 70.62 ] 59.92 T 51.80 I43.38
s1494 1661 10.647920 1139.7 1 01.1 J 77.38 ]76 68 T59.68 T50.12 1 allows this technique to achieve high quality results in sig-
TABLE V nificantly reduced runtimes, when compared with the original
RESULTS FOR ADAPTIVE STRATEGY 4 (STRATEGY 2QUALITIES) implemented algorithms. However, compared to the serial SA
Circuit Number p(.) Serial Time for PaaleSA___ Strategy 2___ version with a similar Adaptive cooling schedule, the speedup
Name of Cellsj SA SA Time p~=3 ] p=4 ]p= p=6Ip6] =7 benefits are less significant, in fact more comparable to the
s1196 561 T 0.630367 37.35 23.71 ] 23.24 ] 21.74 } 20.57 17.95
s1238 540 II0.630573 45.85 33.76 ]24.52 ]19.65 I23.53 I15.03 - difference between Strategy 1 and the original Serial SA -
______ 667 { 0.582884 29.59 21.35 ] 18.26 ] 13.36 ]~13.46 ~12.84
______ 661 0.591114 46.92 27.78 120.09T] 2014~ 17.68 18.16~ achieving the same quality solution in slightly lesser time.
As can be seen, both the serial and parallel run times have IV. CONCLUSION
improved dramatically over Strategy 1, while the runtimes are In this paper, we have presented 4 distinct implementa-
largely equivalent to those of Strategy 2. tions of AMMC PSA. Strategy 1 provides significantly better
Note however, that the speedup characteristics of Strategy solution qualities than the serial algorithm, but only modest
4 are very similar to those of Strategy 1: for the given quality speedup. Strategies 2 and 3 suffer a quality loss of 10 to 15%,
values, speedup never exceeds 3 (Figure 3). but provide near linear speedups for the achieved qualities. Our
III. DiscusSIoN AND ANALYSIS best parallel implementation in terms of both solution quality
Theintllienc ofSA iesin ts oolng-cheule In achievable and run time was Strategy 4 - a new implemen-The
SAinellgeceh liepnesnt italelsacin-chaiedue.ind tation of Simulated Annealing utilizing an adaptive coolingiaMMCPsta,titearchfoinde endentparallelsea chauion period- schedule. Both the serial and parallel versions of this approachicaly tars issarh fom he estavalale oluionat he have shown dramatic improvements in achievable quality and
time. This allows the parallel search to be focused around a runtime over the basic Simulated Annealing implementation,
recent best solution, which would be the logical place to look
a ela vrtefrttreSrtge.W r urnl
for an even better solution. Thus not only does the algorithmic exploring further methods of enhancing this scheme in an
intelligence remain undivided, it is further enhanced using the attempt to achieve better speedup characteristics.
AMMC approach, allowing the achievement of better solutions
in the same or lesser amount of time, as is the case for ACKNOWLEDGMENT
Strategies 1 and 4. The authors thank King Fahd University of Petroleum &
Statgy4 datie Apedu Carctritis orStatgy1 uaitesMinerals (KFUPM), Dhahran, Saudi Arabia, for support under
3 ~~~~~~~~~Project Code COE/CELLPLACE/263.
2.5 ~~~~~~~~~~~~REFERENCES
2
P=7 M p=3 ~~~~~~~~[1]S.-Y Lee and K. G. Lee, "Synchronous and asynchfonous parallel
1.5 ~ ~ ~ ~ ~ ~ Ep4simulated annealing with multiple-markov chains," IEEE Transactions on
cn
1 M p=6 ~~~~~~~~~Parallel and Distributed Systems, vol. 7, no. 10, pp. 993-1008, October
E3 p=7 ~~~1996.
0.5
~~~~~~~~~~~[2]J. Chandy, S. Kim, B. Ramkumar, S. Parkes, and P. Bannerjee, "An
0 ~~~~~~~~~~~~~~~~evaluationf parallel simulated annealing strategies withapplication to
sl1196 s1238 sl1488 sl1494 standard cell placement," IEEE Transactions on Computer-Aided Design
Circuits of Integrated Circuits and Systems, vol. 16, no. 4, pp. 398-410, April
1997.
Fig. 3. Speedup Characteristics of Parallel Adaptive SA (Strategy 4) for [3] J. A. Khan, S. M. Sait, and M. R. Minhas, "Fuzzy Biasless Simulated
solution qualities of Strategy 1 Evolution for Multiobjective VLSI Placement ," In Proceedings of the
As fo Straegies2 an 3, hwever we ee tht altough
IEEE Congress on Evolutionary Computation, CEC'2002., vol. 2, pp.As fo Straegies2an ee 1642-1647, 2002.
a division of the workload has a positive impact on runtime, [4] 5. M. Sait and H. Youssef, VLSI Physical Design Automation: Theory
there s an dversimpat on chievale qulity.This an be and Practice. World Scientific, Singapore, 2001.thereis anadvere ac achivablequalty. i c [5] P.Banerjee, Parallel Algorithms for VLSI Computer-Aided Design. Pren-
understood by looking at how the intelligence of the algorithm tieHlInmaoal194
