Parallel mapping and circuit partitioning heuristics based on mean field annealing by Bultan, Tevfik
PARALLEL MAPPING AND CIRCUIT 
PARTITIONING HEURISTICS BASED ON MEAN
FIELD ANNEALING
A TEBSIS
''^O «■'¡‘•■I’'"' r> ·^
·" ’ '¡T  »"I r  V t ■*'?· i  ·'■■! #· '‘■ ■ I T '  'C l i /•yi·-* C» ,'~ < r ·.·'.■ X i
A N D  T H E  m 'U·'·.'·”·«; G·^ ' P .K L p rG V ip T ; '.a,'Mr·' «rjv;-7'>.7,r‘i::··
Nw^ ' W, w* ii· ^  -¿il^  » ·' W' A -t Ji. V  M»' 1  '*4· «J  J. A
T>vV p^: >:ri^r>nT>
T7r',nv ·.“-' rv:-,r;;''/’,^ .'K·. i'’"iv
•*.v ii f  V^ -jTV ^  .i- w iL U ‘»w»' 4-
* j *vvy·'·^ W,,'· "l '-.' '.r*
J^*' -WCr A' M»' -4
PARALLEL MAPPING AND CIRCUIT 
PARTITIONING HEURISTICS BASED ON MEAN
FIELD ANNEALING
A THESIS
SUBMITTED TO THE DEPARTMENT OF COMPUTER, 
ENGINEERING AND INFORMATION SCIENCE 
AND THE INSTITUTE OF ENGINEERING AND SCIENCE 
OF BILKENT UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS





T < ei/|lL  S u l i x i o
tarafiodao ba|i§lannu$tir.
( і0 2 ~ Т
Ь.  ІІІІ.23
I certify that I have read this thesis and that in my o])in- 
ion it is fully adequate, iu scope and in quality, as a thesis 
for the degree of Master of Science.
Assoc. Prof. Cev;d€i^Aykanat(Principal Advisor)
I certify that I have read this thesis and tha..t in my opin­
ion it is fully adequate, in scope and in quality, as a thesis 
for the degree of Master of Science.
Assoc. Prof. Kemal Oflazer
I certify that 1 have read this thesis and that in my opin­
ion it is fully adequate, in scope and in quality, as a thesis 
for the degree of Master of Science.
i/\_
Asst. Prof. Ihsan. Sabuncuoglu
Approved by the Institute of Engineering and Science:
Prof. Mehmet Baray, Director of the Institute m Engineering and Science
ABSTRACT
PARALLEL MAPPING AND CIRCUIT PARTITIONING 
HEURISTICS BASED ON MEAN FIELD ANNEALING
Tevfik Bultan
M. S. ill Computer Eiigiiieeriiig and Information 
Supervisor: Assoc. Prof. Cevdet Aykanat 
January 1992
nence
Moan Field Annealinp; (MFA) aJgoritlim, receñí,ly proposc'd for solving com 
binatorial optimization problems, combines the characteristics of nenral net­
works and simulated annealing. In this thesis, MFA is formulated for tlie 
mapping i)roblcm and the circuit partitioning problem. EHicient implemen­
tation schemes, which decrease the complexity of the proposed algorithms by 
asymptotical factors, are also given. Perlormances of the proposed MFA algo­
rithms are evaluated in comparison with two well-known heuristics: simulated 
annealing and Kernighan-Lin. Results of the experiments indicate that MFA 
can be used as an alternative heuristic for the mapping problem and the cir­
cuit partitioning problem. Inherent parallelism of the MFA is exploited by 
designing efficient parallel algorithms for the proposed MFA heuristics. Paral­
lel MFA algorithms proposed for solving the circuit partitioning problem are 
implemented on an iPS(J/2’ hypercube multicompute.r. Experimental results 
show that the proposed heuristics can be efficiently parallelized, which is crucial 
for algorithms that solve such computationally hard problems.
bPSCJ/2 i.s a registered trademark of Intel Corporation
IV
Keywords; Mtuui I'^ ield Annealing, Neural Networks, Simulated Annealing, 
Combinatorial Optimization, Mapping Problem, Circuit Partitioning Problem, 
Parallel Processing, Multicomputers.
ÖZET
ORTAK ALAN TAVLAMASINA DAYANAN PARALEL 
EŞLEME VE DEVRE PARÇALAMA ALGORİTMALARI
Teviik Sultan
Bilgisayar Mühendisliği ve Enforınatik Bilimleri Bölümü
Yüksek Lisans
Tez Yöneticisi: Assoc. Prof. Cevdet Aykanat
Ocak 1992
Birle.'jimsel eniyileme problemlerini çözmek için önerilen Ortak Alan 
Tavlama (OAT) algoritması, .sinir ağlan ve tavlama benzetimi yöntemlerinin 
özelliklerini ta.şır. Bu çalışmada, OAT algoritma.sı, eşleme ve devre parçalama 
problemlerine uyarlanmıştır. Önerilen algoritmaların karmaşıklığını asimtotik 
olarak azaltan verimli gerçekleme yöntemleri de geliştirilmiştir. Önerilen al­
goritmaların başarımları tavlama benzetimi ve Kernig'.ıan-Lin algoritmaları ile 
kıyashyarak değerlendirilmiştir. Elde edilen .sonuçlar OAT’nin eşleme ve de­
vre parçalama problemlerini çözmek için alternatif bir algoritma olarak kul­
lanılabileceğini göstermektedir. Önerilen OAT algoritmaları verimli bir şekilde 
paralelleştirilmiştir. Devre parçalama problemi için önerilen paralel OAT algo­
ritmaları iPSC/2 hiperküp çok işlemcili bilgisayarında gerçeklenmişti!·. Deney­
sel sonuçlar öiK'rilen algoritmaların verimli bir şekilde paralelleştirilebildiklc'i ini 
göstermektedir.
VI
Anahtar kc'liınelnr : Ortak Alan Tavlaması, Sinir Ağhırı, la.vlama. Hcnı- 
ze.tiîTîi, Birlei^inısel Eniyileme, Fy.^ leiTie Pı-oblemi, Devre Parçalama Prohh'mi, 
Paralel İşleme, Çok İşlemcili Bilgisayarlar.
ACKNOWLEDGEMENT
I am very grateful to my supervisor Assoc. Prof. Cevdet Aykanat as he 
tauglit me what research is, and always provided a motivating support during 
this study.
I would also like to express my gratitude to Assoc. Prof. Kemal Oflazer 
and Asst. Prof. Ihsan Sabuncuoglu for their remarks and comments on this 
thesis.






2.1 Hopfield Neural Networl\.s 6
2.1.1 Combinatorial Optimization U.sing Hopfield Neural Net­
works ......................................' .............................................  7
2.1.2 Problems of Hopfield Neural Networks 8
2.2 Simulated A nnealing ........................................................................  9
2.3 Mean Field Annealing 11
3 MFA FOR THE MAPPING PROBLEM 14
3.1 The Mapping P ro b le m ....................    14
.3.2 Modeling tlie Ma|)ping P r o b l e m .......................................................... 17
3.3 Solving the Mapping Problem Using MFA 21
3.3.1 Form ulation............................................................................ 23
3.3.2 An Efficient Implementation S c h e m e ................................  28
3.4 Performance of Mean Field Annealing Algorithm 30
3.4.1 MFA Implementation.............................................  31
viii
CONTENTS ix
3.4.2 Keruighau-Lin Implementation 31
3.4.3 Simulated Annealing Im plem entation.................................  32
3.4.4 Experimental R esults......................■....................................  33
3.5 Parallelization of Mean Field Annealing Algorithm 37
4 MFA FOR THE CIRCUIT PARTITIONING PROBLEM 45
4.1 The Circuit Partitioning P ro b le m ..................................................... 45
4.2 Modeling the Circuit Partitioning P rob lem .....................................  46
4.3 Solving the Circuit Partitioning Problem Using M F A ................ 49
4.3.1 Graph Model 49
4.3.2 Network M o d e l.....................................................................  51
4.4 Parallelization of Mean Field Annealing Algorithm 56
4.4.1 Graph Model 57
4.4.2 Network M o d e l.....................................................................  59
5 CONCLUSIONS 63
List of Figures
2.1 Simulated annealing algorithm.
2.2 Mean field annealing algorithm. 12
10
.'1.1 A mapping prol)lem in.stance, with (a) TIC, (b) РОС (which
represents a 2-dimensional hypercube) and (c) PCG. 22
3.2 MFA algorithm for the mapping problem. 27
3.3 Node, program for one iteration of the parallel MFA algorithm
for the mapping problem. 43
4.1 Modeling of a given circuit ptirtitioning problem instcince with
(a) graph and (b) network models. Dashed lines indicate an 
example partition. 48
4.2 Two possible solutions for the given circuit partitioning problem
instance. 54
4.3 Node program for one iteration of the parallel MFA algorithm
for the graph partitioning problem...................................................  58
4.4 Speed-up (a) and efficiency (b) curves for the graph partitioning
problem.................................................................................................  60
4.5 Node program for one iteration of the parallel MFA algorithm
for the network partitioning problem. 61
LIST Ol·' ¡''ICWilCS XI
Ί.() Ь’і)(хі(,1-п|) (а) a.iul ('íliciciicy (Ь) curves Γυΐ' I,he net,work parl.il.iuii·
iiig problem. 62
List of Tables
3.1 Averages of the total communication costs of the solutions found
by KL-RB, KL-PM, SA and MFA heuristics, for randomly gen­
erated map|)ing problem instances. 34
3.2 Avcra.g<'s of tlu' c.oinputa.tiuna.l loads of the minimum and ma.x-
imum loaded processors for the solutions found by KL-RB, KL- 
PM, SA, MFA heuristics, for randomly generated mapping prob­
lem instances..................................................................................... 35
3.3 Average execution times (in seconds) of KL-RB, KL-PM, SA 
and MFA heuristics, for randomly generated mapping jjroblem 
instances. 36
4.1 Mean cut sizes of the solutions found by MFA, KL, and SA 




Some cognitive tasks as pattern recognition, associative recall, guiding of a me­
chanical hand are easily handled by biological neural networks whereas they 
remain as time consuming tasks for digital computers. This fact motivated 
scientists and opened a research area called Artificial Neural Networks (ANN). 
Scope of ANN includes understanding and modeling of biological neural net­
works, and designing artificial devices that have similar propertiiis. liesearch 
on this area started with the early works of McCulloch and Pitts (19T‘l), and 
has continued with varying levels of popularity until today. From the 1980s 
onwards, neural network models became the center of extensive study, and 
have seen an extraordinary growth of interest in their properties. Reasons for 
this increase in popularity are: better understanding gained on information 
processing in nature; increasing computer power which enables scientists to 
make better simulations and analysis of the models; growing int<;rest in paral­
lel computation and analog VLSI.
Research on ANN can be divided into two streams: first one deals with 
understanding and modeling of the biological neural networks, and second one 
exploits the information gained on biological neural networks for designing arti­
ficial devices or algorithms to perform tasks which are difficult lor conventional 
computers. Until last lew years, works on the second area were concentratiKl 
on learning and classification capability, and associative memory operation of 
the neural networks. Recent works by Hopfield and Tank [11, 12, 13, 31] show 
that solving NP-hard combinatorial optimization problems is another promis­
ing area for ANN. Hopfield and Tank proposed that, Hopfield type coniinuoxLs 
and dc/.(:'rniin7.s'/v'c ANN model can be used for solving combinatorial optimiza­
tion problems [11]. However, simulations of this model reveal the fact that it
CHAPTER 1. INTRODUCTION
is lia.rd 1,0 ol)(.a.iii feasible solutions for la.r,t!,<' |)roI)l('in si'/rs. Many variants of 
the Hopfield Neural Network (HNN) have been designed [d, 30, 34] in order to 
improve the model for obtaining feasible, and </ood solutions.
Combinatorial optimization problems constitute a large class, which is en­
countered in various disciplines. Optimization problems, in general, are char­
acterized l>y searching for the hesi values of given varia.l:)les to achic've a. goal. 
In technical words, the objective is the minimization or maximization of a 
function, subject to some other constraint functions. A typical example is the 





/ij(x) =  0
1 . .  .. ,m
1 . .  . . , p
( 1 . 1 )
where / ,  <y,·, hj are general functions which map 3?“ The function /  is
called the cost function, and functions gi and hj are called constraint func­
tions. Problems, for which the variables of the cost and constraint functions 
are discrete, are called combinatorial optimization problems. Some [iroblems 
in this class can not be solved in polynomial time with the known methods. As 
the problem size increases, computing time needed to solve this kind of prob­
lems increases exponentially, resulting with intractable instances. This class of 
problems, ca.lled Nl*-hard optimization problems, are solved using heuristics. 
Heuristics are generally problem specific, computationally efficient algorithms. 
Tho’y do not guarantee to find optimal solution, but require much less com­
puting time. The drawback of heuristics is that they usually get stuck in local 
minima.
In the last decade a ])owerful method, called Simulated Auiu'aling (SA), 
has been developed for solving combinatorial optimization problems [18]. This 
method is the application of a successful statistical method, which is used to 
estimate the results of annealing process in statistical mechanics, to combina­
torial optimization problems. SA is a general method (i.e. it is not problem 
specific) which guarantees to find the optimum solution if time is not limited. 
Time needed for simulated annealing is also too much and exact solutions of 
NP-hard problems still stay intractable. Nice property of simulated cuinealing
CHAPTER 1. INTRODUCTION
is that, it can be used as a heuristic to obtain near optimal solutions in lim­
ited time, and as the time limit is incrccised, quality of the obtained solutions 
also increase. SA has the capability of escaping from local minima if sufficient 
time is given. This method has been successfully applied to various NP-hard 
optimization problems [18, 20, 23].
The subject of this thesis is a recently proposed algorithm, called Mean 
Field Annealing (MFA) [22, 33, 34, 35]. MFA was originally proposed for solv­
ing the traveling salesperson problem [33, 34]. It combines the collective com­
putation property of HNN with the annealing notion of SA. MFA is a general 
strategy and can be applied to various problems with suitable formulations. 
Work on MFA [4, 5, 21, 22, 34, 35] showed that, it can be successfully applied 
to combinatorial optimization problems. In this thesis, MFA is formulated for 
two well-known, NP-hard, combinatorial optimization problems: the mapping 
])roblem and the circuit partitioning problem.
The mapping prol^lem arises while developing parallel programs for 
distributed-memory, message-passing parallel computers (multicomputers). In 
order to develop a parallel ¡program for a multicomputer, first the problem is 
decomposed into a set of interacting sequential sub-problems (or tasks) that 
can be executed in parallel. Then, each one of these tasks is mapped to a 
processor of the parallel architecture, in such a way that the total execution 
time is minimized. This mapping phase is called the map|)ing problem [2],uind 
is known to be NP-hard. In this thesis, MFA is formulated for solving the 
mapping problem, and its performance is compared with the performances of 
other well-known heuristics.
Partitioning of VLSI circuits are needed in various phases of VLSI design. 
Partitioning means to divide the components of a circuit into two or more 
evenly weighted partitions, sucli that the number of signal nets interconnecting 
them is minimized. This problem, called the circuit partitioning problem, is 
also an NP-hard combinatorial optimization problem. In this work, MFA is also 
formulated for solving the circuit partitioning problem, and the performance of 
the proposed algorithm is compared with the performances of other well-known 
heuristics.
C I IA I ’TEU I. IN m O IJ U d ' IO N
Heviristics used for solving NP-hard combinatorial optimization prol^lc'ms as 
the mapping problem and the circuit partitioning problem are time consuming 
processes and parallelization of them is crucial. I'here is a la.rg<i volume of 
research on the parallelization of such algorithms. One of the motivations in 
this work is to exploit the inherent parallelism in neural networks in order 
to obtain efficient parallel algorithms. MFA is a good candidate lor edicient 
parallelization as it uses the collective computation property of HNN.
In order to develop a parallelization scheme, first the parallel computer 
that will be used must be classified. Classification of jrarallel architectures can 
be done according to their memory organization, the number of instruction 
streams supported, and the interconnection topology. Memory organization in 
parallel architectures can be divided into two main groups, shared-memory and 
distributed-memory architectures. In shared-memory architectures, which are 
named as multiprocessors, a common memory or a common addr<\ss space is 
used by all processors. On the other hand, in distributed-memory architectures, 
processors can not access to a common memory space. Each processor has a 
local, isolated memory. Synchronization, coordination among ])rocessors and 
data, exclumge are achievetl by message |)assing among procoissors. lii this tyi)e 
of architectures, each processor may be viewed as ¿in individual com])uter, 
henc(; tluiy are ca.lled multicomiuiters.
Classification according to the interconnection topology determines how to 
handle communications among processors. Most commonly used topologies are 
mesh, hyiiercube and ring.
According to the number of instruction streams supported, parallel archi­
tectures can be divided into two groups. SIMD (Single Instruction stream 
Multiple Data stream) and MIMD (Multiple Instruction stream and Multiple 
Data stream) architectures. In a SIMD architecture, a centra.l control luoces- 
sor broadcasts the instruction that will be executed to all processors. Each 
processor executes the same instruction using the data in its local memory. In 
MIMD architectures, each processor is able to fetch, decode and execute an 
instruction by itself, which can be different from, the instructions executed by 
other processors.
CHAPTER 1. INTRODUCTION
In this work, MFA is parallelized for distributed-memory M1K4D multicom­
puters, and implemented on a 3-dimensioual iPSC/2 hypercube multicomputer. 
A d-dimensional hypercube consists of P =  2‘^ processors with each processor 
being directly connected to d other (neighbor) proces.sors [28]. The proces­
sors of the hypercube are labeled with d-bit binary numbers, and the binary 
label of each ])rocessor differs from that of its neighbor in exactly one bit. 
The parallelization schemes proposed in this work can also be used for SIMD 
multicomputers and other interconnection topologies with slight modifications.
In Chapter 2, HNN and -SA are reviewed and a general formulation of MFA 
is given. Chapter 3 presents the proposed formulation of MFA for the niiipping 
problem. Efficient implementation and parallelization of the proposed MFA 
algorithm is al.so cvddressed in this cliapter. In Chapter 4, MFA is formulated 
for solving the circuit partitioning problem. Chapter 4 also presents efficient 
implementation and parallelization of the proposed algorithm. In Chapter 3 
and 4, performances of the proposed MFA algorithms are evaluated in compar­
ison with two well-known heuristics: simulated annealing and Kernighaii-Lin. 
In Chapter 5, conclusions are stated.
2. THEORY
Tills cliaptor reviews previous works on ITo])field Neural Networks (IINN) and 
Simulated Annealing (SA) to give a better understanding of Mean Field An­
nealing (MFA). In Section 2.1 neural network models proposed by Hopfield 
are briefly discussed, and application of HNN to combinatorial o|)timix;ation 
]>i’ol)l(ims is dcsci‘il)ed. A summary of the later works on IINN is also i)r(iseut<id 
at the end of Section 2.1. Section 2.2 gives the general properties of simulated 
annealing and describes its application to combinatorial optimization prob­
lems. In Section 2.3, MFA algorithm is described, denoting the similarities 
with previously mentioned two methods.
2.1 H opfield N eural N etw orks
One of the main reasons for the growing interest on neural networks in the 
last decade, is the Artificial Neural Network (ANN) model proposed by Hop- 
field [9]. Many ideas used in this model have precursors spread over the fifty 
years of research on neural networks. The importance of the work by Ho|)- 
field is that it brings them all together, using a ])hysical analogy and a clear 
mathematical analysis, and gives a good view of the possible capabilities of 
the proposed model. Later, Hopfield proposed another model [10] that has 
the same properties of the original model, and looks very promising for VLSI 
implementations.
The original model [9] is a discrete, stochastic model, which uses two-state 
neurons with a stochastic updating algorithm. The continuous and deternrinis- 
tic model, which is proposed later [10], u-ses neurons with graded response, and
6
time evolution of the state of the system (change in the states of the neurons) is 
described by a differential equation. In these two models, an energy function, 
which always decreases as the system iterates, is defined. In his two consecu­
tive papers [9, 10], Hopfield presented his ANN models as Content Addres.sable 
Memor}' (CAM) in order to explain their properties. In CAM model, minima 
of the energy function correspond to the stored words. Starting from a given 
initial state, the system is expected to reach one of these minima, which means 
to output one of the stored words in the CAM. CAM model of Hopiield can 
be regarded as an optimizing network: given an in])ut, find one of tlx· stoix'd 
items which is the clo-seftt item to the given input. In his later works with 
Tank [11, 31] it is shown that well-known combinatorial oj)timization problems 
as the traveling salesperson prol)lem, can also be solved by IINN.
2.1.1 C om binatorial O ptim ization  U sing H opfield N eu ­
ral N etw orks
Hopfield and Tank showed that, continuous iind deUTininistic HNN has collec­
tive computational properties [11, 12, 13]. In collective computation, decisions 
taken to solve the problem is not determined by a single unit, but instead re­
sponsibility is distributed over a large number of simple, massively connected 
units. The nature of collective computation suggests that it might be par­
ticularly effective for problems that involve global interaction among different 
parts of the problem. NP-hard optimization problems are such ])roblems. HNN 
can be used for solving a combinatorial optimization problem by choosing a 
representation scheme in which the output states of neurons can be decoded 
as a solution to the target problem. Then, HNN is constructed accordingly by 
choosing an energy function whose global minimum value corresponds to the 
best solution of the problemjto be solved [11]. Hence, the constructed HNN is 
expected to compute the best solution to the target problem starting from a 
randomly chosen initial state by minimizing its energy function. General form 
of such an energy function (also called Hamiltonian of the system) is
CHAPTER 2. THEORY 7
7/ = cost -j- global constraint (2.1)
Неге, cost term re)>resents tlie cost function of the oi^timization to
be solved and global constraint term represents the constraint functions intro­
duced to obtain feasible solutions. Exact solution of the problem corresponds 
to the global minimum of this energy function.
Motivation behind the works of Hopfield and Tank is to use hardware im­
plementations of HNN to solve large optimization problems. It is a general 
method to simulate a model on computers before implementing it on hardware 
in order to observe and solve possible problems. In order to simulate HNN on 
a Computer, first the' equations of motion for the neural network are written 
from the state equations of the neurons. Then, these equations are solved for 
each neuron iteratively using a numerical metliod (usually I'hiler’s method is 
used to compute the resulting diiferential equations). .State of each neuron is 
computed in discrete time intervals until a stable state is found.
2.1.2 P roblem s o f H opfield N eural N etw orks
СПЛРТЕП.2. TIIFA)RY s
HNN have been applied to various optimization problems and reasonable rc'- 
sults have been obtained for small size problems. However, simulations of this 
network reveals the fact that, it is hard to obtain feasible solutions for large 
proldem sizes. Wilson and Pawley reports that, most of the simulation results 
give infeasible tours even for a 10-city traveling salesperson problem [36]. In 
fact, it is possible to obtain feasible tours by adjusting the parameters of the 
energy function (i.e., increasing the weights of the terms regarding feasibil­
ity), but, quality of the solutions deteriorate with such attempts. As is cilso 
iudicateil in [14], the problem of (inding a balance among pcirameters ol the 
energy function, in order to obtain feasible cuid solutions, becomes harder 
as the problem size increa.ses. Hence, the algorithm does not have a good 
scaling property, which is a very important performance criterion for heuristic 
optimization algorithms. Many attempts have been done to improve the per­
formance of Hopfield neural network for obtaining feasible and good solutions. 
In one of them [3], number of terms in the energy function is decreased to in­
crease the scalability of the algorithm. But also for that model, increase in the 
size of the problem causes the costs of the solutions to increase siguificiuitly.
Works by Szu [30] and Toomariau [32] are also modifications to HNN in which 
dilTerent energy functions are proposed. Recently, MFA is proposed as a suc­
cessful alternative to HNN [22, 33, 34]. MFA algorithm combines the collective 
computation property of HNN and annealing notion of SA.
2.2 Sim ulated  A nnealing
CIIAPrFJl2. THFX)RY i)
SA is a powerful method which is used for .solving hard optimization prol)lems. 
In SA, an energy function that corresponds to the cost function of the ])roblem 
to be solved is defined, similar to energy function defined for HNN. SA is a 
probabilistic hill-climbing method, which accepts uphill moves with a proba­
bility in order to escape from local minima. SA is derived using analogy to a 
successful statistical model of thermodynamic processes for growing crystals.
Configuration of a solid state material at a global energy minimum is a 
perfectly homogeneous crystal lattice. It is determined by experience that such 
configurations can be achieved using the process of annealing [20]. The solid- 
state material is heated to a high temperature until it reaches an amorphous 
liquid state. Then it is cooled slowly, according to a specific annealing schedule. 
If the initial temperature is sufficiently high to ensure a random state, and if 
the cooling schedule is sufficiently slow to guarantee that the ec|uilil)rium is 
rearhcd at each temirerature, final configuration of the material will Ixi clo.se 
to the perfect crystal with global energy minimum [20]. In thermodynamics, it 
is stated that, when thermal equilibrium at temperature T  is reached, a state 




where Z(T) is a normalization factor and ks is the Boltzmann constant [20].
There is a fine theoretical model which explains this physical phenomenon. 
During the annealing process the states of the atoms are perturbed by small 
random changes. If the change in state lowers the energy of the system, it is 
always accepted. If not, the change in configuration is accepted with a prob­
ability Tiiie probability of accepting perturbations causing increase
CHAPTER 2. THEORY 10
1. Get an initial configuration C
2. Get initial temperature, and set T  = To
3. While not yet frozen DO
3.1 While eciuilibrium at T  is note yet reached DO
3.1.1 Generate a. rajulom neighbor C' of C
3.1.2 Let A E  E{C') -  E{C)
3.1.3 If A E  < 0 (downhill move), set C = C
3.1.4 if AE > 0 (u])hill move), set C = O' with 
probability e ~ ^
3.2 Update T  according to the cooling schedule
Figure 2.1. Simulated annealing algorithm.
in energy decreases with the decreasing temperature, and minor modifications 
occur at lower temperatures. Experiments show that this model gives simihir 
results as physical annealing process [20].
Kirkpatrick a,])])lied this model to ojitimization problems and called the 
resulting method SA. In transforming the physical model to com])utational 
model, energy function is replaced with the cost function of the optimization 
problem to be solved (note the similarity with HNN), and states of the matter 
are replaced with the legal configurations of the ])roblem instancxi. Annealing 
schedule is controlled with a simulated temperature. Figure 2.1 illustrates the 
SA algorithm.
Although SA is a ])owerful method it has some problems. It requires a large 
amount of computing power because of the need for generating a large number 
of configurations, and very slow cooling in order to reach eciuilibrium at each 
temperature. Performance of the algorithm is closely related to the generation 
of neighboring configurations. It is an iidierently sequential algorithm which
CHAPTER 2. THEORY 11
does not give good peifonnance on parallel computers. It is hard to obtain 
good cooling schedules that, results with good solutions in small amount of 
computer time.
2.3 M ean F ield  A nnealing
MFA merges collective computation and annealing properties of the two meth­
ods described above, to obtain a general algorithm for solving combinatorial 
optimization problems. Mapping problems to MFA is identical to HNN. A 
neuron matrix is formed such that when neurons take their final values they 
represent a configuration in the solution space of the problem.
Mathematical analysis of MFA is done by analogy to Ising spin model, which 
is used to estimate the state of a system of particles or spins in thermal equi­
librium. Spins in MFA algorithm are analogous to the neurons of HNN. This 
method was first proposed for .solving the traveling-sa.lc-'S])er.son ])roI)lem [33], 
and then it is applied to the graph partitioning problem [4, 5, 21, 35]. Here, 
general formulation of MFA algorithm [35] is given for the sake of complete­
ness. In the Ising spin model, the energy of a system with S spins has the 
following form:
s s
= 5 E  E  trusts, + E  '‘«--Si (2-3)
^ k = l k = \
Here, ftki indicates the level of interaction between spins k, /, and G {0, I} is 
the value of spin k. It is assumed that ftu =  fttk and f k^k = 0 for 1 < k, /, < S.  
At thermal equilibrium, spin average {sk) of spin k can be calculated using 
Boltzmann distribution as follows
1
(■s.) = (2.4)1 q. e-<l‘k/r
Here, (pk represents the mean field effecting on spin A:, which can be computed 
using
d{H{s))d>k = -
where the energy average {H{s)) of the system is




CHAPTER 2. THEORY 12
1. Get initial temperature, and set 7’ ■ 7o
2. Initialize the spin averages (s) = [{'Si)) · · ·, (•i'A.·), · · ■, (•¡’.s)]
3. While temperature 7’ is in the cooling range DO
3.1 While system is not stabilized for current temperature DO
3.1.1 Select a spin k at random.
3.1.2 Compute using
4>k = -  fhii'Si) -  hk
3.1.3 Update {$k) using
(s,) = {l + e-'^'</^}-i
3.2 Update T  according to the cooling schedule
Figure 2.2. Mean field annealing algorithm.
The complexity of com])uting using Eq. (2.5) and E(|. (2.(3) is ex|)onen- 
tial [35]. However, for large number of spins, the mean field approximation can 
be used to compute the energy average as
(2.7)
1
№ )) = T E  E  + E
“ t- l  l:jik k=-i
Since (7/(s)) is linear in (¿¡t), mean field <j)k can be computed using the following 
equation
rll I-l ( __
(2.8)*  =  - ^ ^  =  - ( E f c W  + M
Thus, the complexity of computing (/>/.. reduces to 0 (5 ).
At each temperature, starting with initial spin averages, the mean field 
eifecting· on a randomly selectcid s|)in is found using Rf|. (2.<S). 'ГЬеп, spin 
a.vcrage is updated using F/(|. (2.4). d'liis |)roc('ss is r(‘pe;>.ted for ;i. random 
sequence of spins until the system is stabilized for the current temperature. 
The general form of the Mean Field Annealing algorithm derived from this 
iterative relaxation scheme is shown in Figure (2.2). MFA algorithm tries to
CHAFTER 2. THEORY i;{
find eciuilibrium poinl, of a system of S spins using annealing ¡n'oress simila.r 
to SA.
The state equations used in MFA are isomorphic, to the state equcvtious of 
the neurons in the HNN. A synchronous version of MFA, different from the 
algorithm given in Figure 2.2, can be derived by solving N  difference equations 
for N  spin values simultaneously. This technique is identical to the simulations 
of HNN done using numerical methods. Thus, evolution of a solution in a 
HNN is equivalent to the relaxation toward an equilibrium state affected by 
the MFA algorithm at a fixed temperature [35]. Hence MFA can be viewed as 
an annealed neural network derived from HNN.
HNN and SA methods have a major difference: SA is an algorithm im­
plemented in software, whereas HNN is derived with a possible hardware im­
plementation in mind. MFA is somewhere in between, it is an algorithm im­
plemented in software, having potential for htirdware realization [34, 35]. In 
this work, Mi*'A is treated as a software algorithm as SA. Results obtained are 
comparable to other software algorithms, conforming this point of view.
3. MFA FOR THE MAPPING PROBLEM
III tins clia]M.cr, Mean I'^ iekl Amicaliiig (M1''A), is (omuilaled for Uic ma|)|)iiig 
problem. In Section 3.1, the mapping problem is described and previous ap­
proaches used for solving the mapping problem are summarized. Section 3.2 
presents a formal definition of the mapping problem by modeling the par­
allel program design process. Section 3.3 presents the proposed formulation 
of the MFA algorithm for the mapping problem. An efiicient impleiiUMitation 
scheme for the proposed algorithm is also described in Section 3.3.2. Section 3.4 
presents the performance evaluation of the MFA algorithm for the mapping 
problem in comparison with two well-known mapping heuristics: simulated 
annealing and Kernighan-Lin. Finally, efficient parallelization of the Mh'A al­
gorithm for the mapping problem is proposed in Section 3.5.
3.1 T he M apping Problem
Today, with the aid of VLSI technology, parallel computers not only exist in 
research laboratories, but are also available on the market as powerful, gen­
eral purpose computers. Use of ])arallel computers in various applications, 
makes the problem of mapping parallel programs to parallel computers more 
crucial. The mapping problem arises while developing parallel programs for 
distributed-memory, message-pa,ssing parallel computers (multicom])uters). In 
multicomputers, processors neither have shared memory nor have shared ad­
dress space. Each processor can only cvccess its local memory. Synchronization 
and coordination among processors are achieved through explicit message pass­
ing. Processors of a multicom])uter are usually connected by utilizing one of
FI
CH AFTER :j. MFA FOR THE MAPPING PROBLEM 15
the well-known direct interconnection network topologies such as ring, mesh, 
hypercube, etc. These architectures have the nice scalability feature due to the 
lack of shared resources and the increasing bandwidth with increasing number 
of processors.
However, designing efficient parallel algorithms for such architectures is not 
straightforward. An efficient ])arallel algorithm should exploit the full potential 
power of the architecture. Processor idle time and the interprocessor commu­
nication overhead may lead to poor utilization of the architecture and hence 
poor overall system performance. Processor idle time arises due to the uneven 
load balance in the distribution of the computational load among processors 
of the multicomputer. Parallel algorithm design for multicomputers can be 
divided into two phases: first phase is the decomposition of the problem into a 
set of interacting sequential sub-problems (or tasks) which can be executed in 
parallel. Second phase is mapping each one of these tasks to a processor of the 
parallel architecture in such a way that the total execution time is minimized.I
This mapping phase, named as the mapping problem [2], is very crucial in 
designing efficient parallel programs.
For a class of regular problems with regular interaction patterns, the map­
ping problem can be efficiently resolved by the judicious choice of the de­
composition scheme, in such problems, chosen decomposition scheme yields 
an interaction topology that can be directly embedded to the interconnection 
network topology of the multicomputer. Such approaches can be referred as in­
tuitive approaches. However, intuitive mapping approciches yield good results 
only for a restricted class of problems, under simplifying assumptions. The 
mapping problem is known to be NP-hard [15, 16]. Hence, heuristics giving 
sub-optimal solutions are used to solve the problem [1, 2, 6, 15, 16, 26]. Two 
distinct approaches have been considered in the context of map))ing heuristics, 
one phase approaches and two phase approaches [6]. One pliase approaches, 
referred to as many-to-one mapping, try to map tasks of the pcirallel program 
directly onto the processors of the multicomputer. In two phase approaches, 
clustering phase is followed by a one-to-one mapping phase. In the clustering 
phase, tasks of the parallel program is ])artitioned into a's many equal weighted 
clusters as the number of ])rocessors of the multicomputer, while minimizing
CliAPTFJl :i MFA FOR 11 IF MAPPING PROBLFM l(i
the total weight of the inter-cluster interactions [26]. In the one-to-one mapping 
phase, ca.cli cluster is assigiuul to an iiKlividua.] |)roc(‘ssur of tlu' miilticom|Hit<u· 
such that total inter-processor communication is minimized [26].
In two phase approaches, the problem solved in the clustering phiise is 
identical to the multi-way graph partitioning problem. Graph partitioning is 
the balanced partitioning of the vertices of a graph into a number of bins, such 
that the total cost of the edges in the edge cut set is minimized. Kernigiian- 
Lin (KL) heuristic [7, 17] is an efficient heuristic, originally propo.scid for the 
graph bipartitioning problem, which can also be used for clustering [6, 26]. 
KL heuristic is a non-greedy, iterative improvement technique that can escape 
from local minima by testing the gains of a sequence of moves in the search 
space before performing them. A variant of the KL heuristic can be used for 
solving one-to-one mapping problem encountered in the second phase [6].
Simulated Annealing (SA) can also be used cis a one phase heuristic for 
solving many-to-one mapping problem [23, 29]. Successful applications of SA to 
the mapping problem is achieved in various works [23, 29]. It has been observed 
that the quality of the .solutions obtained using SA are superior compared with 
the results of the other heuristics.
Heuristics proposed to solve the mapping problem are compute intensive 
algorithms. Solving the map])ing ])roblem can be thought as a i)re])roce.ssing 
done before the execution of the parallel program on the parallel computer. 
If the mapping heuristic is executed sequentially, the execution time of this 
preprocessing can be included in the serial portion of the parallel program, 
which limits the efficiency that can be attained. In some cases, the sequential 
overhead caused by this preprocessing is not acceptable, cuid the need for the 
parallelization of the preproce.ssing arises. Efficient parallel mapping heuristics 
are needed in such cases. KL and SA heuristics are inherently sequential, hence 
hard to parallelize. Efficient parallelization of these algorithms remain as an 
important issue in parallel processing re.search.
In this chapter, Mean Field Annealing (MFA), is formulated for the many- 
to-one mapping problem. MFA has the inherent parallelism that exists in most 
of the neural network algorithms, which makes this algorithm a good candidate
CHAPTER 3. MFA FOR THE MAPPING PROBLEM
for parallel mapping heuristics.
3.2 M odeling th e M apping Problem
Parallel program design phases are elaborated in this section in order to present 
a formal definition of the mapping problem. In the first phase of jiarallel 
algorithm design, problem is decomposed into a set of atomic tasks, such that 
the overall problem is modeled as a set of interacting tasks. Each atomic task 
is a sequential process to be executed by an individual processor of the parallel 
architecture. .Selection of the decomposition scheme depends on the i)ioblem, 
algorithm to be used for the solution, and the architectural features of the 
targ(it m u 11 i com p u t(u·.
In various classes of problems, interaction pattern among the tasks is static. 
Hence, the deconqmsition of the algorithm can be represented l)y a static task 
graph. Vertices of this graph represent the atomic tasks and the edge set 
represent the interaction i)a.tt(irn among the tasks. Relative c.om|)uta.tional 
costs of atomic tasks can be known or estimated priori to the execution of the 
parallel program. Hence, weights can be associated with the vertices to denote 
the computational costs of the corresponding tasks.
There are two different models used for modeling static inter-task communi­
cation patterns. These two models are referred as the Task Precedence Graph 
(TPG) model and Task Interaction Graph (TIG) mcdel [16, 25]. TPG is a 
directed graph where directed edges represent execution dependencies. In this 
model, a pair of tasks connected by an edge can not be executed independently. 
Each edge denotes a pair of tasks: source task and destination task, ddic' des- 
tiiicition task can only be executed after the completion of the execution of the 
source task. Hence, in general, only the subsets of tarsks which are unreachable 
from each other in the TPG can be executed independently.
In TIG, the set of interaction patterns are represented by undirected edges 
among vertices. In this model, each atomic task can be executed simultaneously 
and independently. Each edge denotes the need for the bidirectional interaction 
between corresponding pair of tasks at the completion of the execution of
CHAPTER 3. MFA FOR THE MAPPING PROBLEM
these tasks. Edges may be associated with weights which denote the amount 
o( l)idirectional inlormation exchange involved between pairs of tasks. 'I’lC 
usually represents the repeated execution of the tasks with intervening inter- 
ta.sk interactions denoted by the edges.
The TIG model may seem to be unrealistic for general applications since it 
does not consider the temporal interaction dependencies among the tasks [25]. 
However, there are various classes of problems which can be successfully mod­
eled with the TIG model. For example, iterative solution of systems of equa­
tions, and problems arising in image ])rocessing and computer graphics a.|)pli- 
cations can be represented l.)y TIG. In this work, mapping of ju-oblems which 
can be represented by TIG model is addressed.
Second phase of the parallel algorithm design is the assignment of the indi­
vidual tasks to the processors of the parallel architecture, so that the execution 
time of the parallel program is minimized. This problem is referred as tlie 
mapping problem. In order to solve the mapping problem, parallel architec­
ture must also be modeled in a way that represents its architectural features. 
Parallel architectures can easily be represented by a Processor Organization 
Graph (POG), where nodes represent the processors and edges represent the 
communication links. In fact, POG is a graphical representation of the in-
tcrcoMn('ci.ii)U topology ul.ili'/cd lor the org<uiiz;i.tion ol tlie processors ol tlie
parallel architecture. In general, nodes and edges of a POG are not associated 
with weights, since most of the commercially available multicom|)uter archi­
tectures are homogeneous with identical processors and communication links.
In a multicomputer architecture, each adjacent pair of processors commu­
nicate with each other over the communication link connecting them. Such 
communications are referred as single-hop communications. However, eiich 
non-adjacent pair of processors can also communicate with each other via soft­
ware or hardware routing. Such communications are referred as multi-hop com­
munications. Multi-hop communications are usually routed in a .static manner 
over the shortest path of links between the communicating pairs of processors. 
Communications between non-adjacent pairs of processors can be associated 
with relative unit communication costs. Unit commu;'; ication cost is defined
as the cominunication cost per unit of information. Unit communication cost 
between a pair of processors will be a function of the shortest path between 
these processors and the routing scheme used for multi-hop communications. 
For example, intermediate processors in the communication path are inter­
rupted in software routing so that each multi-hop communication is realized as 
a sequence of single-hop messages. Hence, in software routing, the unit commu­
nication cost is linearly proportional to the shortest path distance between the 
pair of communicating processors. Note that, in this communication model, 
unit communication costs between adjacent pairs of processors are taken to be 
unity.
Hence, the communication topology of the multicomputer can be modeled 
by an undirected complete graph, referred here as the Processor Communi­
cation Graph (PCG). The nodes of the PCG represent the proces.sors and 
the weights associated with the edges represent the unit communication costs 
between pairs of processors. As is mentioned earlier, PCG can easily be con­
structed using the topological properties of the POG and the routiiuj scheme 
utilized for inter-processor communication. In the PCG, edges betwec'.n i>airs 
of nodes representing the adjacent pairs of processors denote physical links 
whereas edges between ])airs of nodes representing non-adjacent pairs of i)ro- 
cessors denote virtual communication links (i.e. communication paths) estab­
lished for routing multi-hop communications.
The objective in mapping TIG to PCG is the minimization of the exiMictcul 
execution time of the parallel program on the target architecture represented by 
the TIG and the PGG respectively. Thus, the mapping problem can be modeled 
as an optimization problem by associating the following quality measures with 
a good mapping : •
CHAPTER 3. MFA FOR THE MAPPING PROBLEM 1!)
• Interprocessor communication overhead should be minimized. Tasks 
which have high interaction, i.e., large amount of data exchange, should 
be in the same ])roce.ssor or nearby processors.
• Gomputational load should be uniformly distributed among processors. 
Gomputational load assigned to each processor should be equal as much 
as possible in order to minimize processor idle time.
CllAPTEll 3. MFA FOR TIIF MAPPING PROBLEM 20
The parallel execution time is expected to decrease as these criteria are satis­
fied.
A mapping problem instance can be formally defined as follows. An in­
stance of the: ma.p|)ing probhnn iiic.ludcs two undirect<4l gra.plis, 'Га..чк lnt<n- 
action Graph (TIG) and Processor Communication Graph (PCG). The TIG 
Gt {V,E), has |y | = N  vertices labeled as (1 , 2 , . . . ,  г, , Л^ ). Vertices of 
the TIG represent the atomic tasks of the parallel program. Vertex W('ight »;,■ 
diuiotcs the computational cost associated with ta.sk i for 1 < i < N. lodge 
weight e,j denotes the volume of interaction between tasks i and j  connected by 
edge {i,j) G E. The PCG Gp{P,D), is a complete graph with |P( = К  nodes
and |D| = ( ^  ) edges. Nodes of the PCG, labeled as (1 , 2 , . . .  ,p, . . . ,  K),
represent the processors of the target multicomputer. Edge weights dpq, for
1 < P) <7 ^  V Ф Ч1 denote the unit communication cost between proces­
sors p and q.
Given an instance of the mapping problem with TIG, Gt {V·, E)., and PCG, 
Gp{P,D), question is to find a many-to-one mapping function M : V P,
which assigns each vertex of the graph Gp to a unique node of graph Gp\ and
minimizes the total interprocessor communication cost {GO)
CC = ^  dijdM{i)M(j) (3.1)
computational load of proces- 
1 < P < К  (3.2)
while having the computational load {CLp 
sors p)
CL,  =  ^  wi,
i e v , M { i ) = p
of each processor balanced. Here, M{i) = p denotes the label (p) of the ]>ro- 
cessor that task i is mapped to. In Eq. (3.1), each edge {i,j) of the TIG con­
tributes to the communication cost (CC), only if vertices i and j  are mapped 
to two different nodes of the PGG, i.e., M{i) 7  ^ M{j). The amount of contri­
bution is equal to the product of the volume of interaction between these 
two tasks and the unit communication cost dp,, between ])rocessors p and q 
where p = M{i) and q = M{j). The computational load of a processor is the 
summation of the weights of the tasks assigned to that processor. Perfect load 
balance is achieved if CLp = 1 ^  P ^  · Balancing of the
(JHAFTER 3. MFA FOR THE MAPPING PROBLEM 21
computational loads of the. processors can be explicitly included in the cost 
function using a. term which is ininimized when the loa.ds of tln  ^ |)ioc('ssors ai(' 
equal. Another scheme is to include balancing criteria implicitly in the algo­
rithm. Figure 3.1 illustrates a sample mapping problem instance with /V = 8 
tasks to be mapped onto /v = 4 processors. Figure 3 .1 (a) shows the TIG with 
N = H t.asks. Fignn^ 3 .1 (b) shows Uk' РОС for a. 2-dimensiona.l hypercub<‘, 
and Figure 3.1(c) shows the corresponding PCG. In Figure 3.1, numbers inside 
the circles denote the vertex labels, and numbers within the parenthesis denote 
the vertex or edge weights. Binary labeling of the 2-dimensional hypercube is 
also given in Figure 3.1(b). Note thcit unit communication cost assignment to 
edges is performed assuming software routing protocol for multi-hop commu­
nications. A solution to the mapping problem instance shown in Figure 3.1 
is
i 1 2 3 4 5 6 7 8
M{i) 1 1 4 3 2 4 2 3
Communication cost of this solution can be calculated as
8
CC — Y2 = 8
Computation loads of the ])rocessors are CL·,, = 3 lor 1 < p < 4. Hence, perfect 
load balance is achieved since, (^f=j t0i)/4 = 3.
3.3 Solving th e M apping Problem  U sing M FA
In this section, a formulation of the Mean Field Annealing (MFA) algorithm 
for the mapping problem is proposed. The TIG and PCG models described 
in Section 3.2 are used to rei^resent the map])ing problem. Tin* formulation 
is first ])resented for problems modeled by dense TIGs. The modification in 
the formulation for map])ing problems that can be re|)resented l>y sparse' TIGs 
is pro'seuitiid later. In this section, an efficieuit implementation scheme fur the 
proposed formulation is also pro'sented.
CHAPTER 3. MFA FOR THE MAPPING PROBLEM ‘>9
(2) ( 1)
(0 0 ) (0 1 )
(c)
Figure 3 .1 . A mapping problem instance, with (a) TIG, (b) POG (whicli 
represents a 2-dimensional hypercube) and (c) PCG.
CHAPTER. :l MFA FOR T il F M APPING PRO BLEM
3.3.1 Form ulation
A spin matrix, which consists of N  task-rows and K  processor-columns, is 
used as the representation scheme. Hence, N  x K  spins are used to encode 
the solution. The output s,·,, of a spin (i,p) denotes the probability of mapping 
task i to processor p. Here, .s,,^  is a continuous variable in the range 0 < .s·,,; < 1 . 
When MFA algorithm reaches to a solution, s])in values converge to 1 or 0 
indicating the result. If .s,·,; is 1 , this means that task i is mapped to processor p. 
For example, a solution to the mapping instance given in Figure 3.1 can be 
represented by the following N  x K  spin matrix.
K Processors 










1 0 0. 0 
1 0  0 0 
0 0 0 1 
0 0 1 0  
0 1 0  0 
0 0 0 1 
0 1 0  0 
0 0 1 0
Note that, this solution is identical to the solution given at the end of Sec­
tion 3 .2 .
Following energy (i.e., cost) function is proposed for the mapping ])ioblem
fi(s) — ^  ^2  X] i^jSipSjqdpq -b '22 22 (3-4)
“  t =  l j j i i  p= l  q:fip ^  t = l  v = \
Here, Cij denotes the edge weight between the pair of tasks i and j ,  and Wi 
denotes the weight of task i in TIG. Weight of the edge between processors p 
and q in the PCG is represented by dpq.
Under the mean field approximation, the expression {H{s)) for the expected 
value of objective function given in Eq. (3.4) will be similar to the expression 
given for //(s) in Eq. (3.4). However, in this case, .s,,,, s,·,, and .Sjp sliould be 
replaced with (.sip), (.s,·,,) and (.Sj,,) respectively. For the sake of simplicity, .s,·,.
CHAPTER 3. MEA FOR THE MAPPING PROBLEM 24
is used to denote the expected value of spin (f,p) (i-e·, si>in average' (-Sip)) in 
the following discussions.
In Eq. (,3.4), the terni .s·,·,, x-Sj,, denotes the probability that task i and ta.sk j  
are mapped to two different processors p and q, respectively, under the mean 
field cipproximation. Heneé, the term e¿j x s,p x Sj,, x dp, represents the weighted 
interprocessor communication overhead introduced due to the mapping of the 
tasks i and j  to different processors. Note that, in Eq. (3.4), the first quadru­
ple summation term covers all processor pairs in the PCG for each edge pair 
in the TIG. Hence, the first quadruple summation term denotes the total in­
terprocessor communication cost for a mapping represented by an instance of 
the spin matrix. Then, minimization of the first quadruple summation term 
corres])onds to th(i minimization of tlie interproc(is.sor ('.ommnnica.ti(ni ovím IuvuI 
for the given mapping problem instance.
Second triple summation term in Eq. (3.4) computes the summation oí the 
inner products of the weights of the tasks mapped to individual processors 
for a mapping. Global minimum of the second triple summation term occurs 
when equal amount of task weights are mapped to each processor. If there is 
an imbalance in the mapping, second triple summation term increases with the 
square of the weight of the imbalance, penalizing imbalanced mappings. The 
parameter r in Eq. (3.4) is introduced to maintain a balance between the two 
optimization objectives of the mapping problem.
Using the mean field approximation described in Eq. (2.8), the expre.ssion 
for the mean field </;,·„ experienced by spin (?',p) can be found to be
N i< N
i‘3^ pn  ^ Sjp'WjWj
.if-' 'ifv .if'
(3.5)
In a feasible mapping, each task should be mapped exclusively to a single 
processor. However, there exists no penalty term in Eq. (3.4) to handle this 
feasibility constraint. This feasibility constraint is explicitly handled while 
updating the spin values. Note that, from Eq. (2.4), individual spin average 
Sip is proportional to i.e. Sip a Then, S{p is normalized as
Z L  c*··./·'·
(Xfi)
CHAPTER. :i MFA FOR THE MAPPING PROBLEM ■)r·,
This normalization enforces the summation of each row of the spin matrix to 
be equal to unity. Hence, it is guaranteed that all rows of the s])in matrix will 
have only one spin with output value 1 when the system is stal)ilized.
Eq. (3.5) can be interpreted in the context of the mapping problem as 
follows. First double summation represents the rate of increase expected in 
the total interprocessor communication cost by mapping task i to processor p. 
Sc'cond siimimdion niprescmts the rate of iiicrea.se in tlie computational loa.d 
balance cost associated with processors p by mapping task i to processor p. 
Hence, —<i>ip may be interpreted as the expected rate of decrease in the overall 
quality of the map]>ing by mapping task i to proces.sor p. Then, in Eq. (3.6), 
Sip is updated such that the probability of task i being mapped to processor p 
increases with increasing mean field experienced by spin {i,p). Hence, the 
MFA heuristic can be considered as a gra,dient-d(iscent typ(  ^ algorithm in this 
context. However, it is also a stochastic algorithm similar to SA due to the 
random spin update scheme and the annealing process.
In the general MFA algorithm given in Figure 2.2, a randomly chosen spin 
is updated at a time. However, in the proposed formulation of the MFA for 
the mapping problem, K  spins of a randomly chosen row of the spin matrix 
;i.re updated at a time. 'I'liis update operation is |)erfonn('<t a.s follows. Meaii 
fields (/)ip, {I < p < K)  experienced by the spins cit the i-th row of the spin 
matrix are computed by using Eq. (3.5) for p = 1 , 2 , . . . ,A '. Then, the spin 
averages $ip, I < p < K  are updated using Eq. (3.6) for p = 1 ,2 , . . . ,  /1'. Each 
row update of the spin matrix is referred as a single iteration of the algorithm.
The system is observed after each spin-row update in order to detect the 
convergence to an equilibrium state for a given temperature [34]. If energy 
function 11 is not decreasing after a certain number of consecutive spin-row 
updates, this means that the system is stabilized for that temperature [34]. 
Then, T  is decreased according to the cooling schedule, and iteration j^rocess 
is re-initiated. Note that, the computation of the energy difference AH, ne­
cessitates the computation of H (Eq. (3.4)) at each iteration. The complexity 
of computing H is 0{N'^ x K^), which drastically increases the complexity of 
one iteration of MFA. Here, we propose an efficient scheme which reduces the
CIIArri'Hi. :i. MI'A FOR. Till·: hdAIRRNC IRiOliLFM i(;
coniplexity of energy difFerence computa.tion by an a.sym|)l,o(,ical r;i.c(.or.
The incremental energy change SHip because of the increnienial change' S.^ ip 
in the value of an individual spin (f,p) is
8H = SHip = <j)ipSs,p (3.7)
due to Eq. (2.5). Since, H{s) is linear in Sip (see Eq. ( 3.4)), above equation is 
valid for ciny amount of change A.s,·,, in the value of s|;.n that is
A/-/ = A Hip — (j)ipAstp (3.8)
At each iteration of the MFA algorithm, K  spin vedues are updated in a .syn­
chronous manner. Hence, Eq. (3.8) is valid for all spin updates performed in 
a particular iteration (i.e. for 1 < p < K). Thus, energy difFerence due to the 
spin-row update operation in a particular iteration can be computed as
i<
AH  = A/-/,· = ^  (l i^pAs
7i=l
ip (3.9)
where As,p =  .5 ·^ ’" — The complexity of computing Eq. (3.9) is only 0{K)  
since mean field (^,p) values are already computed for the spin updates.
The formulation of the MFA algorithm for the mapping problem instances 
with sparse TIGs is done as follows. The expression given for <pip (Eq. (3.5)) 
can l)e modified for sparse TIGs as
i< N
~  y~! SjpWjWj
jeAdj( i )  q^p i+i
(3.10)
Here, Adj{i) denotes the set of tasks connected to task i in the given TIG. Note 
that, sparsity of the TIG can only be exploited in mean field computations since 
spin update operations given in Eq. (3.6) are dense operations which are not 
effected by the sparsity of the TIG.
The steps of the MFA algorithm for solving the mapping problem is given in 
Figure 3.2. Complexity of computing first double summation terms in Eq. (3.5) 
and Eq. (3.10) are 0 { N  x K)  and 0{davg x H) for dense and sparse TIGs 
res])ectively. Here, d„,„, denotes the average degree of the vertices of tlu' sparse 
d'lG. .Second summation opi'rations in Fi]. (3.5) and Eq. (.3.10) are both 0{N)  
for dense and sparse TIGs. Then, complexity of a single mean field com])utation
СНАРПШ 3. MFA FOR THE MAPPING PROBLEM 27
1 . Get initial temperature, and set T — Tq
2 . Initialize the spin averages s = [.Sj i, . . . ,  Si,,,. . . ,  .syv/\·]
3. While temperature T  is'in the cooling range DO
3.1  While H is decreasing DO
3.1.1 .Select a task i at random.
3 . 1 .2  Compute mean fields of the spins at the г-th row
Ф1р — ■“ ^iJ^jq^pq ~ ■^jp'<^i'<^j
3.1.3 Compute the summation
3.1.4 Compute new spin values at the г-th row
^  ^ф,„/т foj. 1 < p < к
3.1.5 Compute the change in energy due to these s|)in iii)dat('s
АЯ = Ei=, -  Si,)
3 .1.6 update the spin values at the г-th row
Sip — ¿'¿p for I < p < К
3.2 Т = с у х Т
Figure 3 .2 . MFA algorithm for the mapping problem.
CHAPTER 3. MFA FOR THE MAPPING PROBLEM 2S
is 0 { N  X K )  and 0{davg x  N + N) for dense (Eq. (3.5)) and sj)ars(i (Eq. (3.10)) 
TIGs respectively. Hence, complexity of mean field computations for a spin row 
is 0 { N  X K^) for dense TIGs, and 0{davg x ■{■ N x K)  for spar.se TIGs (step
3 . 1 .2 in Figure 3 .2 ). Spin update computations (steps 3.1.3, 3.1.4 and 3.1.6) and
energy difference computation (step 3.1.5) are both 0{I()  operations. Hence, 
the overall complexity of a single MFA iteration is 0 { N  x IC^ ) for dense TIGs, 
and 0 {dnvg X  X A') for s])arse TIGs.
3.3.2 A n Efficient Im plem entation  Schem e
As is mentioned earlier, the MFA algorithm proposed for the mapping problem 
is an iterative process. The complexity of a single MFA iteration is mainly due 
(.o the iiHiJU) fic'ld (•.om|)uta.tions. In tliis siu'.tiuii, we |>ropos(i ;ui eilieieiit imph;- 
mentation scheme which reduces the complexity of the mean field computations 
and hence the complexity of the MFA iteration by asymptotical factors.
Assume that, ¿-th spin-row is selected at random for update in a particular 
iteration. The expression given for 4>ip (Eq. (3.5)) can be rewritten by changing 
the order of summations of the first double summation term as
l< N N-











Here, Xiq represents the rate of increase expected in the interprocessor commu­
nication by ma])ping task i to a ])rocessor other then q (for the current map])ing 
on processor </), assuming uniform unit communication cost between all pairs 
of processors in PCG. Similarly, •0.> represents the rate of increase expected in 
the computational load balance cost associated with processor p, by mapping 
task i to ]:)rocessors p (for the current ma])])ing on ])rocessor p).
CHAPTER 3. MFA FOR THE MAPPING PROBLEM 29
For an efficient ini])lementation, the overall mean field coinputa,tion involved 
in a single iteration can be computed using the following matrix equation
= - D  X Ai -  7
= -© i -  i 'i'i (3.15)
Here, D is a K  x K  adjacency matrix representing P(X5 (i.e. and
<l>i, Ai \l^ i and ©i are column vectors with K elements, where
· ' * ? * ' · ? ['^tl) · · · i · · · ?
=  [V’ti, · · ·, Vdp, · · ·, ©i = [^ ¿1, · · ·, Oip, . . . ,  9iK]'  ^ (3.16)
The complexity analysis of the proposed implementation scheme for dense 
TICs is as follows. Complexity of computing A,·,/and i/),> 'xAh 0{N).  
Complexity of constructing Aj and 'i'i vectors are both 0 { N  x K), since both 
vectors contain K  such entries. Complexity of computing the matrix-vector 
product required in Eq. (3.14) is 0{IO). Flence, the overall complexity of 
computing the vector (Eq. (3.14)) reduces to 0 {N  x l\ -j- K^) — 0 {N  x 
K)^ since N K  in general. The complexity of K  spin updates and the 
computation of A // are both 0{K).  Thus, the proposed scheme reduces the 
computational com|')lexity of a single MFA itera.tion to 0 {N  x I\) foi' dense' 
TICs with N :> K.
The complexity analysis of the proposed implementation for sparse TICs 
is as follows. Note that, the sparsity of the TIC can only be exploited in the 




for sparse TICs. Hence, the complexity of computing an individual A„, is 
only 0{davg)· Tlius, the complexity of constructing the Aj vector reduces to 
0{dava The complexity of computing the ©i vector in Eq. (3.15) reduces
to 0{davg X H + However, the complexity of constructing the vector 
required in Eq. (3.15) is 0 { N  x /F), dominating the overall comph'xity of tlu' 
mean field coni])utations. The c.om])lexity of computing tin* \P^ i vector can bc' 
reduced as follows. The computation of ■(/>,·,, in Eq. (3.13) can be re-formulated
CHAPTER 3. MFA FOR THE MAPPING PROBLEM 30
as
N N  N
./·/./■ ,/// ./-I






Here, 7 j, represents the computational load of ])ror.e.s.sor />, for tlu' ciirii'iit ma.|)- 
ping on processor j). Note that, computationally, 7 ,, represents weighted sum 
of spin values of the ;>th column of the spin matrix. Hence, initial 7 ,, value of 
each column p (1 < p < K)  c<m be computed by using Eq. (3.19) for the initial 
spin values. Then, jp values can be updated at the end of each iteration (i.e. 
after spin updates) by using
(.3.20)
for 1 < p < K.
The computation of initial 7 ,, values can be excluded from the complexity 
analysis since they are computed only once at the very beginning of the cUgo- 
rithm. In this scheme, the computation of an individual '0,p using Eq. (3.18) 
is ail 0 ( 1) ^»pc.ratiou. Hence, the coiistrucLiou of the Vtj vector reciuired in 
Eq. (3.14) becomes an 0{K)  operation. Thus, the'complexity of computing 
the mean field values reduces to 0(f4„y x R  + K^). Note that, l.he update 
of an individual 7 ,, value (using Eq. (3.20)) at the end of the iteration is an 
(9(1) operation. Hence, the oveicdl complexity of 7 p updates is 0{K)  since 
K  weighted column sums should be updated at each iteration. Note that, 
complexity of spin updates and energy difference computation are also 0{K)  
for sparse TIGs. Hence, the implementation scheme proposed for sparse TIGs 
reduces the complexity of a single MFA iteration to 0{davg x R  + E'^).
3.4 Perform ance of M ean Field  A nnealing A lgorithm
This .section ])resents the i)erformance evaluation of the Mean Field Aniu'aliug 
(MFA) algorithm for the mapping problem, in comparison with two well-known
CHAPTER 3. A4FA FOR THE MAPPING PROBLEM 31
mapping heuristics: Simulated Annealing (SA) and Kernighan-Lin (KL). Each 
algorithm is tested using randomly generated mapping problem instances. In 
tlic following sections implementations are describoxl in order to give a better 
understanding of the discnssc'd algorithms.
3.4.1 M FA Im plem entation
MFA algorithm described in the previous section (Figure 3 .2 ) is im])lemented 
for testing the performance of thé algorithm. Cooling process is started from an 
initial temperature which is found experimentally. For the mapping problem 
instances used in the experiments, initicU tempercvture To is found to be varying 
between 1 < To < 10. Coefficient r which determines the balance between two 
optimization criteria is also found experimentally, varying between 0.1 < r < 
1.5. At each temperature, iterations continued until A H  < e for L consecutive 
iterations. L is set ecpial to N  initially. Parameter c is chosen to Ix' Ix'tween 
lO"'  ^ < e < 1 0 “C Temperature is decreased using a = 0.9 until T  is less than 
To/1.5. Tlien, L is set to ¿/3 and cv is set to 0.5 and cooling is contiiuuxl until 
T  is less then To/5.0. Resulting spin values after this cooling operation are set 
to 0 if they are less than 0.5 and set to 1 if they are greater than 0.5. Then 
the result is decotled as descrilred in Section 3.3 and the resulting ma].)|)ing is 
found.
3.4.2 K ernighan-Lin Im plem entation
Kernighan-Lin heuristic is not directly applicable to the mapping problem since 
it was originally proposed for graph bipartitioning. In order to apply KL 
heuristic to the mapping problem a two phase approach is used. In the first 
phase, task interaction graph Gt {V,E) is partitioned to K  clusters, where K  
is equal to the number of processors. These K  clusters are then mapped to 
proces.sor graph (Ii>{P·, D) using a one-to-one mapping heuristic, in tlx' second 
phase. One-to-one mapping heuristic used in this work is a variant of KL 
heuristic.
For the clustering phase, Kernighan-Lin heuristic is implemented ('lliciently
CHAPTER 3. MFA FOR THE MAPPING PROBLEM 32
as described by Fiduccia and Mattheyses [7]. In order lo apjjly KL to /\-vvay 
grapli partitioning two schemes are used. First one, partitioning by recursive 
bisection (KL-RB), recursively ])artitions the initial graph to two partitions 
until K  partitions are obtained. Other scheme, partitioning by pairwise inin- 
cut (KL-PM), starts with an initial A'-way partitioning and then minimizes the 
cutsizes between each pair of partitions until no improvement can I)e done. In 
KL heuristic balancing of the work load of processors is done implicitly by the 
algorithm. When moving one node from one partition to another, weights of 
the partitions are tested and moves causing intolerable imbalance are rejected.
In the beginning of second phase, K  clusters formed in the first phase are 
mapped to the K  processors of the multicomputer randomly. After this initial 
mcipping, communication cost is minimized by performing a sequence of cluster 
swaps. An individual cluster swap corresponds to interchanging the mapping 
of a pair of clusters.
3.4.3 Sim ulated  A nnealing Im plem entation
Simulated Annealing algorithm, implemented for solving the mapping problem, 
uses the one phase approach to map the TIG onto PCG. In simulated annealing, 
starting from a randomly chosen initial configuration, configuration space is 
searched for the best solution using a probabilistic hill climbing algorithm. A 
configuration of the mapping problem is a mapping between TIG and PCG, 
which assigns each task in TIG to a processor in PCG. In order the search the 
configuration space, neighborhood of a configuration must be defined. For the 
implementation in this work, neighborhood of a configuration consists of all 
configurations which results with moving one vertex (task) of the TIG from 
the maximum loaded node (processor) of the PCG to another node of PCG. At 
each iteration of the simulated annealing algorithm, one of the possible moves is 
chosen randomly as a candidate move. Then the resulting decrease in the total 
communication cost after performing the candidate move is calculated without 
changing the configuration. If the candidate move decreases tlu' cutsize, it 
is realized. If candidate move increases the cutsize, then it is realized with a 
probability which decrea.ses with the increasing positive diiferem'.e cau.sed in th(>
CUAPTFAl :i MFA FOR TIIF MAPPING PROBLFM ;.{3
total cutsize by that move. Acceptance probcibility of tlie moves that increases 
the cost is controlled with a temperature parameter T  which is decreased using' 
an annealing schedule. Hence, as the annealing proceeds acceptance probability 
of uphill moves decreases. Cooling schedule used in the implementation of SA 
algorithm is similar to the schedule given in [22].
3.4.4 E xperim ental R esu lts
In this section, performance of the MFA algorithm is discussed in comparison 
with SA and KL algorithms. These heuristics are experimented for mapping 
randomly generated TIGs onto mesh and hypercube connected multicomput­
ers.
Six test TIGs are generated with N  = 200 and 400 vertices. Vertices of 
these TIGs are weighted by assigning a randomly chosen integer weight between 
1 and 10 to each vertex (1 < iw,· < 10, for 1 < i < N). Interaction patterns 
among the vertices of these TIGs are constructed as follows. A maximum vertex 
degree, dmax, is selected for each test TIG (dmoa,· = 8,16,32) such that, degree 
di of each vertex i is a randomly chosen value between 1 and d„iax (i.e. 1 < d,· < 
dm,,.,·, for 1 < i < N). Then, ea.cli vertex i of TIG is connected to </,· randomly 
chosen vertices. Resulting edges are weighted rcuidomly with integer values 
varying between 1 and 10. These TIGs are mapped to 3-, 4-, 5-dimensional 
hypercubes and 4 x 4, 4 x 8  two dimensional mesh hiulticomputers. PGGs 
corresponding to these interconnection topologies are constructed assuming 
software routing as is described in Section 3 .2 .
Tables 3.1, 3.2 and 3.3 illustrate the performance results of KL-RB, KL- 
PM, SA and MFA heuristics for the generated mapping problem instances. In 
these tables, N  and denote the number of vertices and edges in the test 
TIGs respectively, and K  denotes the number of processors on the target PCG. 
lnt('rcomH'ction topology of the ta,rg(>t POC! is denoted by 'I\ where JI denotes 
the hypercube interconnection to])ology and M denotes the mesh interconiH'c- 
tion topology. Each algorithm is executed 10 times for each problem instance, 
starting from different, randomly chosen initial configurations. Averages of the 
results an; illustrated in 'Tables 3.1, 3.2 and 3.3,
CIIAPTim 3. MFA FOR THE MAPPING PROBLEA^ 34
Table 3.1. Average.s of the total communication costs of the solutions found 
by KL-RB, KL-PM, SA and MFA heuristics, for randomly generated mapping 
problem instances.
PROBLEM SIZE AVERAGE COMMUNICATION COST
N E\ K T KL-RB KL-PM SA MFA
200 544 8 II 1807.4 1846.0 1595.1 1671.4
200 544 16 H 2819.9 2747.1 2180.0 2333.4
200 544 32 H 4098.8 4710.4 2879.0 3181.6
200 1 12 0 8 II 5421.9 5494.7 4947.8 5092.4
200 1 12 0 16 H 7742.4 7816.1 6699.1 6840.3
200 1 12 0 32 II 10377.1 11280.2 8495.7 9200.3
200 2152 8 H 12721.6 12959.0 12018.5 11956.2
200 2152 16 H 17828.9 17859.9 16201.2 16261.2
200 2152 32 II 23127.6 24260.3 20407.0 20586.0
400 1227 8 H 4360.6 4444.5 3772.3 4235.6
400 1227 16 II 6096.0 6073.2 5086.4 5615.9
400 1227 32 II 8420.2 7999.9 6485.0 7184.0
400 2283 8 II 11247.1 11491.5 10152.1 10744.3
400 2283 16 II 15566.7 15896.9 13626.7 14197.5
400 2283 32 II 20543.8 20527.1 17169.8 18209.6
400 4298 8 II 25318.3 25832.1 23507.6 23561.1
400 4298 16 II 34590.6 35395.4 31427.2 32127.6
400 4298 32 H 45053.8 45098.1 39453.0 40133.8
200 544 16 M 3364.2 3318.7 2659.7 2996.0
200 544 32 M 5618.7 6822.5 4260.4 4580.0
200 1 12 0 16 M 9234.2 9318.2 8432.3 8121.7
200 1 12 0 32 M 14659.9 16476.4 13556.0 12456.9
400 1227 16 M 7341.4 7357.0 6293.0 6745.0
400 1227 32 M 12207.4 11758.6 9924.8 10780.0
400 2283 16 M 18670.9 19133.0 17480.1 16631.6
400 2283 32 M 29827.0 30156.3 28319.1 26078.2
CHAPTEli 3. MFA FOR. THF MAPPING PROBLEM 35
Table 3 .2 . Averages of the computational loads of the minimum and maxi­
mum loaded processors for the solutions found by KL-RB, KL-PM, SA, MFA 
heuristics, for randomly generated mapping pr'oblem instances.
PROBLEM SIZE AVERAGE MIN-MAX LOAD
N 1^1 K T KL-RB KL-PM SA MFA
2ÜÜ 544 8 H 125.0 153.3 126.8 150.2 1.35.1 142.7 1.32.2 143.6
200 544 16 H 59.0 80.0 63.4 75.0 64.0 74.4 54.9 83.1
200 544 32 H 28.6 41.6 30.8 , 37.0 : 29.2 41.0 28.4 41.6
200 1 1 2 0 8 H 121.4 155.6 125.7 150.6 134.1 142.9 127.0 149.4
200 1 12 0 16 H 59.1 81.3 63.3 74.9 64.0 74.9 61.6 77.8
200 1 12 0 32 H 28.6 42.4 29.4 37.0 28.2 42.8 30.7 39.4
200 2152 8 H 120.2 156.9 124.4 149.8 133.3 143.5 128.9 149.2
200 2152 16 H 57.4 81.8 62.0 74.0 63.1 67.9 60.7 79.4
200 2152 32 H 27.3 42.8 31.0 37.0 27.8 40.4 25.8 44.1
400 1227 8 H 250.9 319.4 259.2 313.0 281.7 290.6 281.6 289.9
400 1227 16 H 124.3 164.6 129.4 156.8 138.1 148.8 135.6 L50.4
400 1227 32 II 60.2 87.0 64.6 78.0 66.0 77.0 .58.7 86.7
400 2283 8 H 241.7 313.0 248.4 300.6 280.1 270.7 266.9 284.4
300 2283 16 II 115.7 I59.S 124.3 149.9 1.32.6 143.2 126.5 149.3
400 2283 32 II 56.4 84.5 62.2 74.0 63.5 74.0 62.4 76.4
400 4298 8 II 253.6 331.0 261.6 318.8 285.4 298.3 273.4 309.7
400 4298 16 H 122.2 169.9 131.2 . 158.5 ,1.38.8 153.0 135.3 1.55.2
400 4298 32 H 59.5 88.9 65.0 79.0 67.3 77.7 .58.2 87.6
200 544 16 M 58.6 79.7 63.2 74.8 63.2 74.4 62.8 76.4
200 544 32 M 28.7 41.4 31.0 37.0 29.1 .39.5 26.0 42.6
200 1 12 0 16 M 58.5 81.0 63.2 75.0 64.0 75.8 61.3 77.8
200 1 12 0 32 M 28.7 42.1 30.5 37.0 28.6 42.9 26.1 42.3
400 1227 16 M 12 1 .0 167.0 129.2 156.6 138.1 147.6 136.4 151.4
400 1227 32 M 59.5 86.2 64.1 78.0 64.6 81.8 63.3 80.4
400 2283 16 M 117.4 161.5 124.1 149.9 131.3 146.0 127.3 149.6
400 2283 32 M 56.3 83.9 62.1 74.0 63.0 76.9 .59.6 78.0
CHAPTER 3. MFA FOR THE MAPPING PROBLEM ;U)
'Table .‘j..·]. Average e.xeculion limes (in seconds) of KL-RB, KL-PM, S'A and 
MFA heuristics, for randomly generated mapping problem instances.
PROBLEM SIZE AVERACxE EXECUTION TIMES
N E I< T KL-RB KL-PM SA MFA
200 544 8 H 1.07 5.74 80.72 19..57
200 544 16 IT 1.53 13.70 127.17 46.17
200 544 32 H 3.29 29.60 245.10 101.84
200 1 12 0 8 H 1.63 7.61 64.10 14.39
200 1 12 0 16 H 2.2 14.56 144.04 -58.11
200 1 12 0 32 H 5.11 40.54 282.65 200.53
200 2152 8 H 2.52. 10.93 64.22 26.07
200 2152 16 H 3.46 23.66 156.65 61.94
200 2152 32 H 7.60 45.38 373.85 294.94
400 1227 8 H 2.17 10.05 168.86 25.14
400 1227 16 H 2.98 29.74 310.68 164.17
400 1227 32 H 6.41 68.04 681.10 360.40
400 2283 8 H 3.25 16.02 . 167.07 26.67
400 2283 16 II 4.36 39.79 383.20 88.61
400 2283 32 H 8.61 88.85 6.32.80 221.60
400 4298 8 II 5.42 25.49 L55.25 90.42
400 4298 16 H 7.05 64.88 402.95 171.26
400 4298 32 H 12.59 125.14 553.00 437.62
200 544 16 M 1.5 1.4 165.7 24.8
200 544 32 M 3.3 29.6 2.58.7 82.6
200 1 12 0 16 M 2.3 14.8 124.2 36.2
200 1 12 0 32 M 5.6 38.4 293.1 122.0
400 1227 16 M 3.1 26.7 280.5 108.0
400 1227 32 M 6.7 60.4 565.1 375.2
400 2283 16 M 4.4 41.7 363.8 130.9
400 2283 32 M 8.7 82.8 573.5 540.8
(:ııArτl·:ı{. .i m i 'a I'ou. 'nil·: m a i>i >in c  i >u o b u :m M
'Fahlcs İ5.1 and d.2 ¡llusl.i i^.l,c the c|uality of tlie sulution.s ul)t¿ .^iπed !>>' Kl -^ 
H.n, KL-PM, SA a,nd Mh'A İKMiristics. Averag<i tuta.1 (U)nuımMİ<a,titnı costs t)l 
the solutions are displayed in Table 3.1, and average computational loads of 
the maximum and minimum loaded processors are displayed in Table 3.2. As 
is seen in Tables 3.1 and 3.2, the quality of the solutions obtained by MFA and 
SA heuristics are superior to KL heuristic. Solutions found by SA are slightly 
bett.er compared with the solutions found by MFA, in general. However, in 
some cases MFA performs better. The total communication costs found by 
KL-RB is less than the total communication costs found by KL-PM, however 
load balance of the solutions found by KL-PM is better than KL-RB.
Table 3.3 displays the average execution times of KL-RB, KL-PM, SA and 
MFA heuristics, for the generated mapping problem instances. As is expected, 
KL heuristic, is faster compared witli Mh^ A and SA heuristics. Observe that, 
MFA is always faster than SA. Execution time of MFA is comparable to KL- 
PM whereas, KL-RB is significantly faster compared with MFA and KL-PM. 
However, MFA is expected to perform better if an efficient cooling schedule 
can be devised by analyzing the algorithm in detail, which still remains as an 
open research issue. Furthermore, the execution times displayed in Table 3.3 
for MFA are not obtained by running the most efficient implementation pro- 
])osed in Section 3.3.2. The time complexity of the imi)lemented scheme is 
0 {dnyy X K ^ )  whereas the. complexity of the most efficient scheme propo.scd in 
Section 3.3.2 is 0{davg y- K  + K^)· Hence, the execution time of the algorithm 
is expected to decreiise significantly for large d^ vg and K.
3.5 Parallelization  of M ean Field  A nnealing A lgorithm
As is mentioned earlier, heuristic algorithm used for solving the mapping prob­
lem is a preprocessing overhead introduced for the efficient implementation of 
a given parallel program on the target multicomputer. If the mapping heuristic 
is implemented sequentially, this ])reprocessing can be considered as the serial 
portion of the parallel program which limits the maximum efficiency of the 
parallel program on the target machine. For a fixed parallel program instance.
C l ! A r u m  :{. Ml'A ¡''OR TlUi MAPPING PROPLPM ;{8
(.iic cxi'ciil.ioii l.imcol I hr pa.ra.llrl prugra,in is rxprcU'd l.o drcrrasr wil.li iiicrras- 
ing number of ])roc('ssors in l.lie target multicomputer. Iluwevei', as is seen in 
d'a.bh' lor a. fixcsl 'I'KI, tlu^  execution timeol a.ll ma.|)ping lieuristics inc.iaxise 
with increasing number of processors in the target multicomputer. Hence, the 
serial fraction of the parallel program will increase with increasing numlrer of 
processors. Thus, this preprocessing will begin to constitute a drastic limit 
on the. maximum efficiency of the ovcirall paralleliza.tion due to Amda.lil’s Law. 
Hence, parallelization of these mapping heuristics on the target multicomputer 
is a crucial issue for efficient parallel implementations.
Unfortunately, parallelization of the mapping heuristics introtluc.es another 
mapping problem. The computations of the mapping heuristics should be 
mapped to the processors of the same target architecture. However, in this 
case, the parallel algorithm for the mapping heuristic should be such thiit 
its mapping can be achieved iii.l.uiLnHdy. Furthermore, the in'tuitive mapping 
shoultl lead to an eilicient jrarallel implementation of the mapping heuristic. For 
these reasons, the target mapping heuristic to be parallelized should involve 
regular and inherently parallel computations. MFA algorithm proposed in 
Section 3.3 for the general mapping problem has these properties for efficient 
parallelization. Following paragraphs discuss the efficient parallelization of the 
proposed mapping heuristic for multicomputers.
Assume that, MFA heuristic is to be used to map a given parallel program 
represented with a TIG' having N vertices on a target multicomputer with K 
processors. The MFA heuristic will use an N x I( spin matrix for the mapping 
operation. The question is to map the computations of the MFA heuristic 
to the same target computer (with the same number of K i)rocessors). As is 
mentioned earlier, MFA heuristic is an iterative algorithm. Hence, the mapping 
scheme can be devised by analyzing the computations involved in a particular 
iteration of the algorithm. Atomic task can be considered as the computations 
required for updating an individual spin. Note that, K spin averages at a 
])articular row of the s])in matrix are updated at each iteration. Hence, these 
K  spin updates can be computed in parallel by mapping each spin in a row 
of the spin matrix to a distinct processor of the target architecture. Thus, 
the N  X K  spin matrix is partitioned column-wise such that each processor
CHAPTER 3. MFA FOR THE MAPPING PROBLEM 39
is assigned an individual column of the s])in matrix. That is, column p of 
the spin matrix is mapped to i>rocessor p of the target architecture. Each 
processor is held responsible for maintaining and updating the spin values in 
its local column. Assume that, task-i is selected at random in a particular 
iteration. Then, each processor is responsible for updating the probability of 
task i being mapped to itself.
A single iteration of the MFA algorithm can be considered as a three phase 
process, namely, mean field computation phase, spin update phase, and energy 
difference computation phase. Each processor p should compute its mean field 
^ip (Eq. (3.5) or Eq. (3.10)) in the first phase, in order to update its local spin .s,·,, 
(Eq. (3.(i)) l:>y using this mcii.n li(;ld vah.u.! in the second |)ha.se. As is mentioiuid 
earlier, mean field computation phase is the most time consuming phase of the 
MFA algorithm. Fortunately, mean held com])utations are iiduu'iuitly pa.ra.lhil 
since there is no interactions between mean field computations involved in a 
particular iteration. However, a close look to Eq. (3.5) and Eq. (3.10) reveals 
that each processor needs most recently updated values of all spins except the 
ones in the ¿-th row in order to compute its local mean field value. Recall 
that, each processor maintains only a single column of updated spin values 
due to the proposed mapping scheme. Hence, this computational interaction 
necessitcites global interprocessor communication just priori to the distributed 
mean field computation at each iteration. The volume of global interprocessor 
communication is proportional to 0{N  x K), since each processor p needs all 
updated spin values except the ones in the f-th row, in order to compute its 
local (^ ip. The volume of global interprocessor communication can be reduced 
to 0{K)  by considering the parallelization of the matrix equation given in 
Eq. (3.14).
Eq. (3.14) involves the following operations : construction of the Aj and 
SPj vectors, dense matrix vector product ©i = D x Aj and vector addition
= —©j — Note that, each processor p only needs to compute the pAh 
entry 9ip of the ©j vector, and the ;>th entry ißip of the vector in order to 
compute its local mean field value <j)ip in parallel. The matrix vector product 
can be performed in parallel by employing the scalar accuinalalion (.SA-MVP) 
scheme. In this scheme, each processor needs only the p-th row dp of the dense
CHAPTER. 3. M FA FOR THF h'lAPPlNC PROBLEM 'lU
D matrix and the whole column vector Aj.
Each processor p can concurrently compute the />th entry A,p of the Ai 
vector by using Eq. (3 .1 2 ). Note that, q in Eq. (3 . 1 2 ) should be replaced by 
p in these computations. Then, a global collect (GCOL) operation is requii'ed 
for each processor to obtain a local copy of the A] vector. The GCOL opera­
tion is essentially appending K local scalars, in order, into a vector of si^e K 
and then duplicating this vector in the local memory of each i)roc('ssor. The 
GCOL operation requires global interprocessor communication. Note that, 
only K  local spin values should be collected globally thus reducing the volume 
of,communication during the GC'OL operation by an asymptotic factor of N.
After the GCOL operation, each processor has a local copy of the global 
Aj vector. Hence, each processor p can concurrently compute its local Oip by 
pcrfoi'iniiig tli<‘ iiiiH'i'-product -- dp x Aj. Then, cacli i)iuces.sui' p shoidd 
compute the p-th entry of the ’ÿj vector. Note that, each processor p already 
maintains the value. Hence, each processor can concurrently compute i/’ip 
using Eq. (3.18). Then, each processor p can concurrently compute its local 
mean field value (¡>ip by performing the local computation (¡>ip = —Oip — nl>ip. 
Note that, these computations are completely local computations and involves 
no interjjrocessor communication.
The second phase of an individual iteration of the MFA algorithm is highly 
sequejitial since global interaction exists between spin u|)dates due to tlu' nor­
malization process indicated l^ y Eq. (3.6). Fortunately, the strong interaction 
can be relieved by noting the independent exponentiation o])erations involved 
in the numerator of Eq. (3.6). Hence, each processor p can concurrently com­
pute its local e'M'N' values. Then, a global sum (GSUM) operation is required 
for each processor to olHain a local copy of the global sum of the local exponen­
tiation results. The GSUM operation requires global interprocessor communi­
cation. After the GSUM operation each processor p can concurrently update 
its local spin value by computing Eq. (3.6). After computing .sj·“'*'', each pro­
cessor p should concurrently update its local 7 p values by using Eq. (3.18) for 
the use in the next iteration.
In the third phase, each processor should compute the same local copy of
CHAn'I^Ii :i. MFA Foil THE MAPPING PR.OBLFM
the global energy difference A//,· for global termination detection. Each pro- 
ces.sor p can concurrently compute its local energy difference A//,·,; = <j)ipAsip — 
“  •‘' i'jjO loca.1 s|)iii iii)da.te. d'heu, a (.¡SUM o|Hn'a.tion, which
requires global interprocessor communication, is recpiired for each ¡processor to 
compute a local copy of the global sum A//,· = X2p=i AHip.
Hence, the proposed parallel MFA algorithm necessitates three global com­
munication operations due to the CCIOL operation involved during the first 
phase and two C5SUM operations involved in the second and third phases. In 
fine grain multicomputers, the volume of interprocessor communication is the 
important factor in predicting the complexity of the inter|)rocessor commu­
nication overhead. However, in medium grain multicomputers the number of 
communications is also important since high set-up time overhead is associated 
with each communication step. For example, set-up time is the dominating fac­
tor for short messages in such architectures. Note that, only a single floating 
variable reproisenting the running sum, is communicated during the GSUM 
operations involved in the last two phases of the parallel MFA algorithm.
Hence, reducing the number of GSUM operations required in the MFA 
algorithm will be a valuable asset in achieving efficient implementations on 
medium grain multicomputers. As seen in Eq. (3.9), there is an execution 
dependency between the computation of the energy difference A//, and spin- 
row updates. This execution dependency between the second and the third 






where //,■ =  J2p-i <j)ip-Sip is the partial energy contribution to the total energy 
H due to the spin values at the i-th row (i.e. H = ^«)· 'T'he ex|)rcssion
for the partial energy Hi can be expanded as
E I  {  n r u i -  . s " ' ' ' )i p  /
p = l
E S ’ -Y x p ^ t p
p = I
T j n c w  __  T j o l d
K I<
7;=1 1 1




where A-i = Ei=i = Zp=i Bi = e JLi = E?=i ·
Hence, after each processor p computes its local a,p = e' '^vN and bip — 
values, two global sunima.tions A{ — Ep=i “i> <'■'“ * B^  — Ep=i 
can be accumulated in a sincjlc GSUM operation. After this single GSUM 
operation, each processor p can concurrently updcite its local spin value and 
com])ute its new partial energy value as Sip = UipfAi and /·/?“■’" = BifAi. If 
each processor keeps the partial energy associated with each row then
each processor ma,}' concurrently compute tlie same local copy of the glol)al 
total energy difference A H  = A/·/; = — Hf'L Note tliat, tliis scheme
reduces the number of GSUM operation from two to one. However, the volume 
of interprocessor communication remains the same since two floating point 
variables, representing the running sums A,· and B,·, are communicated during 
the communication steps involved in the GSUM operation.
The node program for a single iteration of the parallel MFA algorithm 
proposed for .solving the mapping problem is given in Figure 3.3. Note that, 
variables with “fp” and “p” subscripts denote the local variables. Variables with 
subscripts denote the global variables which are constructed and duplicated 
at the local memory of each processor after performing the indicated global 
operations. The proposed parallel algorithm can easily be im])lemented on any 
multicomputer having the GGOb and GSUM facilities.
As is seen in Figure 3.3, the propo.sed parallel MFA algorithm achieves 
perfect load balance. The parallel computational complexity of a single MFA 
iteration can be obtaiiuxl as follows. Uuring the parallel computation of A,p’s 
(step 2 ) each processor performs — 1 {(k — 1 ) multiplication/addition oper­
ations for den.se (sparse) TIGs. Here, d,· denotes the degree of vertex i in the 
TIG. During the parallel SA-MVP computation (step 3) each processor per­
forms K  multiplication/addition operations for both dense and sparse TIGs 
since the D matrix is a dense matrix. Each processor performs the same con­
stant amount of arithmetic, operations in the remaining steps (ste|)s h-? and
CHAPrER :i MFA FOR rUE MAPPINC PROBU'A'I ■i;{
1 . Select a task i at random.
2 . Compute Ajp YLj&Adj(i)
3. Perform GCOL operations to obtain a local copy of
Aj [A;I, . . . , Xip, · · · ) At7\]
4. Compute the inner product 0,p = dp^ x A;
5. Compute i/’.-p = t/;,(7p — w.s.p)
6. Compute the local mean field value 4>ip = dip + rtf^ ip
7. Compute a,p = and hip = (j)ipt' "^·^^
8 . Perform GSUM to compute the local copies of
= Ei=i a«p and Bi = Ep==i K
9. Compute .s-jf"' =  (lipfAi and tlien A.s,p = -  Sijf
10. Compute = 5,7/1,· and then A7/,· = /7“''“' — /■/,·
1 1 . Uixlate 7p = jp + 'WiA.s'.p
12. Update ¿¡ip — .sip'" and Hi =
Figure 3.3. Node program for one iteration of the parallel MFA algoritlim for 
the mapping problem.
C IIAP TE I l  :i MFA FOR T I IF  MAPFINC PR.OBUCM
steps 7- 1 2 ). Hence, the parallel computational com])lexity of the pro|)osed al- 
gm ithni is (J[N -|- A ) ami (){<lii„n + A ) for dense ami sparse 'i'l(!s lespeclively. 
Hence, linear speed-up can ecisily be achieved if communication overhead re­
mains ncgligil)le. Th(‘ communication complc.xity due to the (JCX)L (step .‘5) 
and GSUM (step 8) operations are discussed in general in the following para­
graph.
The interconnection schemes used in the processor organization of the mul­
ticomputers are usually symmetric in nature (i.e. POG is symmetric). GSUM 
and GCOL type of operations in such architectures is performed in two phase. 
In the first phase, a sequence of concurrent single-hop communications are per- 
foriTied to accumulate or collect the result in a root processor. In the second 
phase, the final result is broadcast from this root ])rocessor again using a se­
quence of concurrent single-hop communications. The number of concurrent 
single-hop communications in each phase will be proportional to diameter of 
the POG. For example, diameters of hypercube and mesh POGs are log-2 K and 
respectively. The overall concurrent volume of communications will be 
proportional to diameter and number of processors (K) in both phases of the 
GSUM and GCOL operations, respectively. If a full-duplex pair of communi­
cation links are used between each pair of directly connected processors (e.g. 
Intel’s iPSC/2 ) then, such global operations are performed in single plia.se by a 
sequence of concurrent single-hop exchange communications. In such an archi­
tecture, the number of concurrent single-hop communications and the overall 
volume of concurrent communication in GSUM and GGOL operations can be 
reduced by a factor of two.
4. MFA FOR THE CIRCUIT 
PARTITIONING PROBLEM
This chapter presents formulation of Mean Field Annealing (MFA) for solving 
the circuit partitioning problem. Section 4.1 describes the circuit partition­
ing problem, and summarizes the previous works on the circuit partitioning 
problem. In Section 4.2 the circuit partitioning problem is modeled as the 
graph partitioning problem and the network partitioning problem. Section 4..3 
presents the formulation of MFA for the graph partitioning problem and the 
network partitioning problem. MFA algorithms proposed for solving the graph 
partitioning problem and the network partitioning problem are parallelized as 
is describoxl in Section 4.4.
4.1 T he C ircuit P artition ing Problem
Partitioning of a VLSI circuit, which is delined with its components and sig­
nal nets, is an extensively studied problem. Partitioning means to divide tlu' 
components of a circuit into two or more evenly weighted partitions, such that 
the number of signal nets interconnecting them is minimized. This probhun, 
called the circuit partitioning problem, arises while dividing a circuit into parts 
that will be implemented separately. In some layout problems like, placement 
and floor-planning, divide-and-conquer algorithms, which necessitate dividing 
up the circuits hierarchically into parts with different minimization criteria, 
are used. Circuit partitioning is also needed within these algorithms [20]. The 
circuit partitioning problem first appeared because of the need for partitioning 
components of electronic circuits to circuit boards, minimizing the connections
45
CHAPTER 4. MFA FOR THE CIRCUFr PARTmONINC PROBI.FM 4()
l)H.W('<Mi l>o;u(|s. Л iH'urisl.ic for solvin,t!; I.bis prolilcm is L';iv<‘ii in l.lic scmiii;i.l 
])a])er by Keniiglian and Lin [17]. In this work, tlio circuit,s a.rc: rc^prcscnitiHl as 
graphs and the problem is treated as the graph partitioning problem. In a later 
work by Schweikert and Kernighan [27, 37], deficiencies of using graph model 
are illustrated, and a new model called net-cut circuit model is proposed. The 
problem of partitioning circuits using this representation is called the network 
partitioning problem.
As both of the mentioned problems (the graph partitioning problem and the 
network partitioning problem) are proved to be NP-hard [8], finding efficient 
heuristics for them is an important issue. Various heuristics, e.g., Kernighan- 
Lin like algorithms [7], Simulated Annealing (SA) etc., are proposed and im­
plemented for solving tluise problems [20]. In this chapter. Mean Field Anneal­
ing (MFA) algorithm, is formulated for the circuit partitioning problem.
Algorithms used for solving the circuit partitioning problem are time con­
suming processes, and parallelizcition ol them is crucial. In this chapter, par­
allelization of MFA algorithms for solving the circuit partitioning problem on 
distributed-memory, message-passing multicomputers is also addressed.
4.2 M odeling th e  C ircuit P artition ing Problem
An instance of the circuit partitioning problem constitutes of a set of weighted 
components and a list of nets which defines the connection relationshii)s among 
the.se components. Nets can also be weighted; but, as this does not change the 
nature of the problem, we assume the weights of the nets to be unity. An 
(ixamj)le instance of tlui circuit partitioning problem is given ludow.






net 1 : a-b-c-d 
net 2 : d-e
Tlie ])rol)lem is i.o divide; (,lK;giv(;n circuil, into M {M > 2) (;(|ua.lly vv(>iL>;ld.('d 
partitions, while minimizing the nunrber of external connections among par­
titions. In Schweikert and Kernighan algorithm [27, -37], external lines are 
reduced based on the following criteria
1 ) When all components of the same net are in the same block, 
moving any one of the components to another block will create an 
additional external line.
2) If a net has all its components in a block except one component, 
moving that component to the same block will remove the net from 
the rut.
3) If components of a net are in more than one block, number of 
external connections does not change by moving components of the 
net within blocks, if the number of blocks that the net is distributed 
does not change.
CHAPTER 4. MFA FOR THE CIRCUIT PARTITIONING PROBLEM 47
. 'JIn order to transform the given circuit partitioning probhun instanc(‘ to a. 
graph partitioning problem instance, each net is represented by a clique of its 
terminals. Resulting graph instance is shown in Figure 4.1(a). Observe that 
this representation changes the structure of the connections in the given circuit. 
Re]>rc;sentation of the given instance as a network is given in Figure 4.1 (b). A 
network consists of a set of components called cells and a set of signa.1 ix'ts (or 
only nets). A net is a subset of the set of cells. This representation exactly 
simulates the connection relationships among components.
In order to show the deficiency of the graph model, the partitions indicatoxi 
with dashed lines in Figure 4.1 will be examined. Observe that, in Figure 4 . 1 (a), 
cut size is equal to 5. In Figure 4.1(b), it is 2, which is the actual cut size. The 
cost contribution of a unit cost net across a cut of a bipartition is I. The cost 
contribution of a clique, that is evenly split across a cut, rises quadratically 
with the size of the clique. This quadratic growth does not adequately reilect 
the costs arising in practice [20]. Although there can be some attempts to 
solve this dilemma, there is no good way of mapping a circuit instanc(' into a 
gra])h [‘.




Figure 4.1. Modeling of a given circuit partitioning problem instance with (a) 
graph and (b) network models. Dashed lines indicate an example partition.
4.3 Solving th e C ircuit P artition ing Problem  U sing  
MFA
In this section, formulation of MFA for the circuit partitioning problem, using 
two different models is given. Graph and network models are described in the 
following two sections respectively.
4.3.1 Graph M odel
CHAPTER 4. MFA FOR THE CIRCUIT PARTITION INC PROBLEM 49
If the graph model is used for the representation of the circuit partitioning 
problem, the problem can be treated as the graph partitioning problem. A 
formal definition of the graph partitioning problem is as follows: A graph 
G = (V, E) with |F | = vertices ( 1 , 2 , . . . ,  ¿, j , . . . ,  N), vertex weights 
{wi,W2 , . . . ,  Wi, Wj, . . . ,  w n ), and edges E between vertices with weights e,j is 
given. The question is to divide the graph into M  partitions of nearly equal 
weights such that the cut size is minimized.
Similar formulations of MFA for partitioning fully connected graphs are 
given in [4, 21, 35]. However, gra])hs arising in circuit partitioning are nsnally 
sparse. In order to avoid redundant computation, the algorithm is modified 
to work for sparse graphs. As in the previous works [4, 2 1 , 35], a s]un (i.e. 
neuron) matrix which consists of N  vertex-rows and M  partition-columns is 
used as a representation scheme. The output S{p of a spin {i,p) denotes the 
probability of finding vertex i in i)artition p (1 < p < M).
We propose the following energy function for s])arse graphs, where Adj{i) 
denotes the set of vertices connected to vertex i.
N  M .. M Nj yv JVJ JVI J\
(s) = 9 H  E  £  + 9 E  E  E  (4.
jeAdj(i)V=^  ^ ?>=I »=1
1)
Here, (1 — Sjp) denotes the probability of vertex j  being in a partition other 
than partition p. Hence, .s,·,, x (I — Sjp) denotes the probability of vertex i 
being in partition p and vertex j  in a different partition. Then, term e.,j x 
Sip X (1 — Sjp) denotes the cost contribution of edge (f,j) to the cut size by 
mapping vertices i and j  to different ])artitions. As the first summation term in
CH AFTER 4. MFA FOR THE Cl ROUTT FARTmONING FR0BLEA4 50
Eq. (4.1) covens all vertice.s and all partitions, it repre.sents the total cut size of 
a partitioning r(ipre.sented l)y the values of tlie spins in tlui spin matrix. Ihuice, 
this summation term is used for minimizing the weighted sum of edges on the 
cut. Second triple summation term in Eq. (4.1) computes the summation of 
the inner products of the weights of the vertices in each ]>artition. This term 
will have the global minimum value only when the summations of the weights 
of the vertices in each ])artition are equal. The j)arameter r in E(|. (4.1) is 
introduced to maintain a bahuice between the two optimiza.tion objectives of 
the original graph partitioning problem.
Using the mean field approximation given in Eq. (2.8), meaii field of a spin 
(i,p) for the energy function defined in (4.1) can be computed as
N
^ j p )  S j p W j W j
i&Adj(i)
(4.2)
In this equation, first summation term shows the rate of increase in the cut 
size by placing vertex i in partition p. Second summation term shows the rate 
of increase in the cost term, introduced for balancing the partitions, by |)lacing 
vertex i in partition p.
'Plx' probability l,ha.t vertex i is in pa.rtitioii ¡> is Ukmi U(»rma.lize<l as follows;
(4.3)
Note that, this normalization guarantees that each vertex is included in only 
one partition.
MFA algorithm for the graph partitioning problem is similar to MFA algo­
rithm for the mapping problem, which is described in the Cha])ter 3, except 
mean field computations. Mean fields of spins are computed using Eq. (4.2) in 
MFA algorithm for the graph partitioning problem. Note that, second term in 
Eq. (4 .2 ) is same as the second term in the mean field equation of the MFA 
algorithm for the mapping proldem (Eq. (3.5)). Hence, this term can be com­
puted in constant time (0 ( 1 )), for each mean field computation, as described 
in Section 3.3.2 by defining 7 p as
N
Ip  =  E  ’ 'b'Ai 
J=1
JV US)
CHAPTER. 4. MEA EOR THE CIRCUIT PARTITION INC PROBLEM 51
Then, Eq. (4.2) can be rewritten as
^ ^jp) ■'^ ip
j^A(lj{i)
(4.5)
Note that, 7 ,, repre.sents weighted sum of spin values of the p-th column of the 
spin matrix. Hence, initial jp value of each column p (1 < p < M) can be 
computed by using Eq. (4.4) for the initial spin values. Then, jp values can be 
updated at the end of each iteration (i.e. after spin updates) by using
^neu, ^  ^OKI _  ^
tp (4.6)
for 1 < p < M.
Computation of the first term in Eq; (4 .2 ) is 0(d„„y) where, denotes 
the average degree of the vertices of the graph CI{V,E). Then, complexity of 
mean field computations for a spin row is 0{M  X {dauy + 0) ~ 0{M  x </„,„,). 
Complexity of spin update computations and energy difference computation 
performed at each iteration of the MFA algorithm are both 0{M). Hence, the 
overall complexity of a single MEA iteration for the graph partitioning problem 
is 0{M  X davg)·
Performance of the MFA algorithm for solving the graph partitioning prob­
lem in comparison with SA and Kernighan-Lin lieuristics is extensively studied 
in [2 1 , 35]. Results obtained using MFA are very encouraging, comparable to 
results obtained by SA and Kernighan-Lin heuristics.
4.3.2 N etw ork M odel
In this section, a suitable mapping of MFA to the network partitioning |)rob- 
lem is proposed. With this mapping, disadvantages of using graph model to 
represent a circuit partitioning ])roblem instance are avoided. Following is a 
formal definition of the network partitioning problem. A network with N  cells 
(1 ,2 ,. . . ,  . . . ,  N), cell weights {xoi, W2 , · ·., to,·, v>j,. . . ,  u;/v), and a list of /?.c/..s
(7),!,?).2, ...) , with weights {v>nti ‘■'’n-i, ■ ■ ■) E given. The question is to partition 
the network into M partitions of nearly equal weights such that the cui size is 
minimized.
CHAFTEli 4. MFA FOR THE CIRCUIT PARTITIOISINC FROBLEM ry>
Following energy function is proposed for the network partitioning problem
I N M
= 9 E  E  E  E  }■'
1=1 ;;=1 (¡:ji:pnSl^ i 
M N N
 ^ 7^=1 1=1
•S ¿y) w  ^
(4.7)
whei'e N{ denotes the set of nets connected to cell and 7nax{S) denotes the 
maximum value in set .S'. In Eq. (4.7), indicates the set of spin values
which denote the probabilities of finding the cells j  G n- (cells belonging to 
the net n), in partition q. Hence, ?na.T{sj,(,g,q} denotes the maximum spin 
value among the indicated set of spin values. Then, term x
Sip X Wn shows the cost contribution of net n to the cut size by putting cell i in 
partition p and at least one of the cells in net n to another partition. With these 
observations it can be seen that first summation term in Eq. (4.7) represents 
the total cut size cau.sed by the nets whose cells are in more than one partitions. 
Second summation term in Eq. (4.7) is same as the second summation term in 
Eq. (4.1), and maintains the weight balance among partitions.
As described in Chapter 2 mean field of a spin is calculated by taking the 
partial derivative of the energy function with res])ect to the expected value of 
that spin. Energy function defined by Eq. (4.7) is not diii'erentiable because of 
the max{) function. If the mean field of a spin is interpreted intuitively as the 
effect of the values of the other spins to the value of that spin, then mean field 
of a spin (vi,p) due to Eq. (4.7) can l)e written as
M N
<t>iv =  -  I I  { •S j7 ( ie n ))  iyu -  v Y ^ S j p t O i W j
q:^ p nSN,
(4.8)
Note that, in this equation first term shows the rate of increase in the cut size 
by placing vertex i in partition p. Second summation term is similar to the 
term in Eq. (4.2) and has the same meaning as described above.
The normalizcition operation (i.e. normalization of the spin values) nmiains 
same as in the formulation of the graph partitioning problem.
Three MFA algorithms given for the mapping problem, the graph parti­
tioning problem and the network partitioning pi'oblem are same excejjt the 
mean field coiiiputations, which constitute the problem specific part of the
MFA algorithms. Mean field computations in the MFA algorithm For the net­
work partitioning problem are performed using Eq. (4.8). Second term in 
E<|. (4.8) is computed eilic.i(uitly in constant Lime for each mean held compu­
tation as described in the previous section for the graph partitioning prob­
lem. Observe that complexity of computing the first term in Eq. (4.8) is 
0{M  X  c X (,s — 1)) =  0{M  X c X s)  for each mean field computation, where 
M  is the number of partitions, c is the average number of nets that a cell 
is connected, and .s is the average size of a net (size of a net is the Aumber 
of cells in a net). Note that, c x (.s — 1 ) corresponds to average ch;gree of a 
vertex in the graph model (i.e., c x (.s — 1 ) = day,,). At each iteration of the 
MFA algorithm M  si)ins are u])dated, hence, M mean field computations are 
performed. Then, complexity of mean field computations in a single iteration 
of the MFA algorithm is 0{M  x (M x c x s + 1 )) = 0{M'^ x c x s). However, 
this complexity can be reduced using the following observation. R(|. (4.8) ca.n 
be rewritten as
M
= ( 'y ) y ] r7).u;r } ie,i y ) (.Sjjqyg,,)} )
9=1 neNi nÇNi
N
- r  SjpWiWj
N
= ~ ^ 'Y ^  SjpWiXUj (4.9)
where
M
■0.· = IT  JZ (4.10)
7=1 nG A^j
*0//» ~ (./Çn)} (‘hi  1)
uGA/,
Values i/’t and ■i/’ip given in Eq. (4.10) and Eq. (4.11) can lie com|)uted together 
in 0{M  x c x s )  at the beginning of each iteration of the MFA algorithm. Hence, 
complexity of mean field computations for a spin row is 0{M  x c x .s + M) = 
0{M  X c X s). Complexity of spin update computations and energy dill'erence 
computation performed at each iteration of the MFA algorithm are both 0{M). 
Then the complexity of one iteration of the MFA algorithm for the network 
partitioning problem is 0{M  x c x s ) .
CHAPTER 4. UFA FOR THE CIRCUIT PARTITION INC PROBLEM 53
In order to demonstrate the effectiveness of the network model, the behavior 
of the energy function defined in MFA will be examined. Two possible solutions


























Figure 4.2. Two po.s.sil)le soluLion.s for the given circuit partitioning problem 
instance.
CHAPTER 4. MFA FOR THE CIRCUIT PARTITIONINC PROBl.FA4 o5
to the instanco given in .Section 4.2 are illustrated in Figure' 4.2 as A = {a, h, c}, 
B = {c, d} and A = {a,h,d}, B  = {c, e}, where A and B  denote the two 
partitions. Neuron matrix representation of these solutions are also given in 
Figure 4.2 using a 5 x 2 spin matrix.
The energxj values of the two states of the spin matrix defined by Solution 1 
and 2 are computed for the graph model (using Eq. (4.1)) as //j = /1 x 5 -|- 5 
and H-i = /1 x 4  + 5 respectively. The energy values computed for the network 
model (using Eq. (4.7)) are //| = //^ = /1 x .3 + 5. In graph model, second 
solution is favored more than the first solution; but, it can be seen that the 
actual cut sizes are eciual in both solutions. So, in graph model, some solutions 
are favored to other ones although they have the same quality, meaning that 
some features of the circuit partitioning problem is not represented correctly. 
However, in network model energies oi the two solutions are the same H\ = / /2, 
which gives the desired result. Hence, it can be concluded that network model 
is a better scheme for mapping the circuit partitioniirg problem to MFA.
The performance of the proposed MFA algorithm for solving the network 
partitioning problem is demonstrated in Table 4.1 for three different problem 
sizes. MFA is compared with SA and Kernighan-Lin (KL) heuristics. An ef­
ficient variation of Kernighan-Lin heuristic [7] which is proposed for network 
partitioning is implemented. These heuristics are. tested for randomly gener­
ated networks with various number of cells (A^ ) and nets (L), and maximum 
net sizes (.9 ). In the.se networks, weights of the cells and nets are taken to 
be unity. Networks are partitioned into two bins, and balance criteria of the 
heuristics are set such that diiferences between the weights of the resulting 
bins were less than % 5 of the total weights of the cells. As seen in the table, 
performance of MFA is close to SA, and better than KL in some instances. 
Execution time of SA is maximum, 120 times that of KL on the average. MFA 
is, 60-70 times slower than KL and 2 times faster than SA. Time complexity 
of the MFA algorithm used in these experiments was 0{M'^ x c x  $ + N x M). 
In [35], using the notion of critical temperature, better timings of MFA are ob­
tained. Probably, by determining the critical temperature, MFA will run much 
faster for these instances. KL heuristic is faster compared with the general
CHAPTER 4. МЕЛ EOR THE CIRCUIT PARTITIONINC PROBLEM 56
Table 4.1. Mean cul sizes of the solutions found by MFA, KL, and SA heuristics 
for raixlomly generated network partitioning ]rroblein instances.
PROBLEM SIZE MEAN CUT SIZE
N L s MFA SA KL
128 205 4 75.3 74.8 77.6
128 102 8 52.0 49.2 52.4
128 69 16 44.4 41.5 44.3
256 543 4 217.9 2 1 1 .0 217.9
256 240 8 126.8 124.7 126.2
256 200 16 139.5 131.4 134.2
512 784 4 272.0 258.0 273.0
512 809 8 477.6 471.0 481.4
512 336 16 215.4 213.6 219.8
^A and SA since it is an efficient, prol.)lei
tic, having almost linear time complexity. However, KL heuristic can only be 
used for partitioning networks having nets with bounded weights. Linear time 
complexity of KL heuristic, can not be pre.served for other ty|)cs of networks. 
Furthermore, as is described .in the following section, MFA algorithm is more 
suitable for parallelization compared with SA and KL heuristics. Hence, these 
results demonstrate that the proposed mapping of the MFA to the network par­
titioning problem is a promising idternative heuristic for solving the network 
partitioning problem.
4.4 Parallelization  of M ean Field A nnealing A lgorithm
Efficient parallelization of heuristics used for .solving the circuit partitioning- 
problem is crucial since the circuits arising in practice are quite large in gen- 
(!ral. Parallelization schemes for MFA algorithms used (or solving the grai)h 
partitioning problem and the network partitioning problem are described in 
the following sections.
CHAFTER 4. MFA FOR. THE CIRCUIT PARTITIONIEC PROBLEM 57
4.4.1 Graph M odel
For ])cU'allelization of the cilgoritlun, columns of the spin matrix are partitioned 
among ])rocessors such that each processor has M /K  columns of the s])in ma­
trix. Here, K  denotes the number of processors in the target multicomputer. 
Hence, each processor is assigned the data and the comi^utations as.sociated 
with all N vertices for only M /K  partitions. That is, each proces.sor is as­
signed N  X M /K  spins. This decomposition yields i)erfect load balance if M 
is a multiple of K  or M K. Each processor stores its local column slice of 
the global spin matrix in row-wise order for the sake of efficient access to the 
spin values. Host processor initializes the spin matrix and sends to the node 
processors their portions. At each iteration, spin values corresponding to the 
selected vertex are updated by computing the mean field value of each spin, 
and difference between new energy and old energy is calculated. If energy dif­
ference is less than a predefined constant for a number of subsecjuent iterations, 
temperature is decreased, and iteration is started again. Two phases of a MFA 
iteration (i.e., mean field computations and energy difference calculation) are 
interleaved as described for the mapping problem in Chapter 3. The parallel 
algorithm for the node |)rogram for a single iteration of MFA algorithm is given 
in Figure 4.3.
In the parallel MFA algorithm for.solving the graph partitioning problem, 
each processor selects a vertex i at random, where the random sequence in each 
processor is the same. Hence, no global communication is necessary for broad­
casting the selected vertex. Then, each processor computes the mean fields 
of the randomly selected vertex only for its local partitions. After computing 
mean fields of the local spins two partial summation terms are computed at 
steps 3 and 4. Then, a global sum (GSUM) operation is performed at step 5 to 
accumulate the overall summcitions in each processor. Each processor u])dates 
its local spin values at step 6 and computes AH{ at step 7. At step 8 , 7  ^ values 
are updated. Details of the parallel MFA program for solving the graph par­
titioning problem is given in [4]. Note that, only one global communication is 
needed at each iteration of the algorithm. As is mentioned in Section 3.5, global 
communication is performed as a .sequence of single-hop exchange communica­
tions. Volume of rommimic.atioiis ;i.t ('.acli excliang(‘ st(i|) is fixed to 2 lloa.tiiig
CHAPTER 4. MFA FOR THE CIRCUIT PARTITIONING PROBLEM 58
1 . Select a. vertex i at random.
2. For each local partition p 1 to M /K  compute mean field values
^ i p  — ^ j ^ A d j ( i )  1 ' ^ i i O p  ' ^ i ' ^ i p )
3. For each local partition p := 1 to M /K  compute
o-ip = and 6,p =
4. Compute partial .summation.s
/ 1. = EJIV' 0-ip and B, = e JIV' l^ p
5. Perform GSUM to compute the local copies of
Ai = Ep=i (lip and Bi = E iU  bi,t p
6. For each local partition p := 1 to M /K  compute = aip/Ai and 
then Asip = -  s f
1. Compute = Bi/Ai and then A//,· = — /■/,■
8 . For each local partition p := 1 to M /K  update 7 ,, = 7  ^+ wiAsip
9. For each local partition p := 1 to M /K  update s 'P ~ •‘^ip and
Hi = H?
Figure 4.3. Node program for one iteration of tlie parallel MFA algorithm lor 
the graph partitioning problem.
CHAPTER 4. MFA FOR THE CIRCUIT PARTITIONINC PR0BLEA4 59
point words, and does not change with increasing problem size. The nupiber 
()l rxcIwuiL’V ('()mn)imi< a.ti()ii .steps in tlie gluba.1 .siimma.(,ioii upera.I.ion increases
with the diameter of the multicomputer. Diameter of a multicomputer im­
plementing hypercube topology is hence, the given parallel algorithm
is expected to scale on the hypercube architecture. Figure 4.4 illustrates the 
speed-up and efficiency curves for the parallel MFA algorithm for solving the 
graph partitioning problem on a ii-dimensional iPSC/2 hypercube multicom­
puter for three different problem sizes. As is seen in Figure 4.4, si^eed-up and 
eificiency incre<ises with increasing problem size cuid almost linear speed-up is 
obtained for large problem sizes.
4.4.2 N etw ork M odel
Columns of the global spin matrix for the network partitioning problem are 
partitioned similarly among'the processors of the multicomputer, such that 
each processor is assigned M /K  columns of the global spin matrix. As in 
the graph partitioning problem, host processor initializes the spin matrix and 
sends to the node processors their portions. Each processor is I'esponsible for 
the computation of the spin values in its partition. The algorithm for the node 
program for a single iteration is given in Figure 4.5.
Observe that, there is one more global communication (at step 4) in this al­
gorithm because of the first term in (4.8). The rest of the algorithm is similar to 
tlu' |)ar;dl('l MFA a.lgoritlim for the gi'a.|)li parlitiuiiiiig |)roblem. Altluuigh this 
parallel algorithm requires one more global communication, it is also expected 
to scale on the hypercube due to its fixed communication requirement (both 
in number and volume). The speed-up and efficiency curves for the parallel 
MFA a,lgorithm for tlie gra.ph partitioning |)ioblem on a .'{-dinu‘nsiona.l ilhSCy^ 
hypercube multicomputer is given in Figure 4.6. As is seen in Figure 4.6, 
speed-up and efliciency increases as the problem size increases. Almost linear 
speed-up is obtained for large problem sizes.




P i^gure 4.4. Speed-up (a) and efficiency (b) curves for tlie graph parlitioning 
problem. ,
CHAPTER 4. MFA FOR THE CIRCUIT PARTITIONING PROBLEM 61
1. Select a cell i at I'cindom.
2. For each local partition p := 1 to M f K  compute
mClX
3. Compute partial summation
4. Perform GSUM to compute the local copies of




For each local partition p := 1 to M jK  compute mean field values
<kp =  -('/'i- -  <A.>) -  -  WiHiy)
For each local partition p := 1 to M /K  compute 
a,p = and bip = <f>ipe'^ 'HT
Compute partial summations
M = TH !dC iv and =
8. Perionn CvSUM to compute the local copies of
-  Ep=l ‘^'ip fincl Ei = Ei=l l^ ip
9. For each local partition p := 1 to M jK  compute 5·^^ = UipfAi and 
then As.p =
10. Compute = Bi/Ai and then A//,· = 7/“*'" — /7,
1 1 . For each local i)artition 7 ; := 1 to M /A ’ update 7  ^ — 7  ^ + tn.A.s,,,
12. For each local partition p := 1 to M /K  update .s,p = .s""" and
/·/,· =
Figure 4 .5 . Node program for one iteration of the parallel MFA algorithm for 
the network partitioning problem.




Figure 4.6. Speed-up (a) and efficiency (b) curves for the network partitioning 
problem.
5. CONCLUSIONS
Mean Field Annealing (MFA) algorithm, recently proposed for solving combi­
natorial optimization problems, combines the characteristics of neural networks 
and simulated annealing. Previous works on MFA resulted with succoissful ap­
plication of the algorithm to some classic optimization problems such as the 
traveling salesperson problem and the graph partitioning jDroblem. In this 
work, MFA is formulated for the mapping problem and the circuit partitioning 
problem. Performances of the proposed heuristics are investigated by comjiar- 
ing them with other well-known heuristics, and efficient parallel versions of the 
])roposed algorithms are develo])ed.
In chapter 2, MFA algorithm is formulated for the mapping problem. An 
efficient implementation scheme, which decreases the complexity of the pro­
posed algorithm by asymptotical factors, is also given. The performance of 
the proposed MFA algorithm is evaluated in comparison with two well-known 
heuristics: simulated annealing and Kernighan-Lin. Algorithms are experi­
mented for a number of randomly generated mapping problem instances. So­
lution qualities of MFA and simulated annealing heuristics are found to be 
superior to the efficient Kernighan-Lin heuristic. The solution quality of sim­
ulated annealing is slightly better in compaiTson with MFA wheroias, Mk'A is 
faster. As is expected, Kernighan-Lin heuristic is faster in comparison with 
MFA and simulated annealing heuristics. Kernighan-Lin heuristic is faster in 
comparison with general heuristics as MFA and simulated annealing, since it 
is an efficient, problem specific heuristic, having linear time complexity, llow- 
ever, linear time comi)lexity of Kcrnighan-Lin heuristic can not be· preserved, 
if the weights of the edges of the graph to be partitioned are not bounded.
63
CHAPTER Г). CONCLUSIONS ()4
Furthermore, MFA algorithm is more suitable for parallelization in compari­
son with simulated annealing and Kernighan-Lin heuristics. Hence, obtained 
results demonstrate that the proposed formulation of the MFA for the mapping 
problem is a promising alternative heuristic for solving the mapping problem.
Inherent parcdlelism of the MFA is exploited by designing an efficient i)ar- 
allel algorithm for the proposed MFA heuristic for the mapping problem. Pro­
posed parallel MFA algorithm achieves perfect load balance, and has fixed 
communication requirement which does not increase with the size of the prob­
lem instance.
MFA algorithm is formulated for solving CPP using two alternative models 
in Chapter 3. It is shown that network model is a better scheme for mapping 
MFA to the circuit partitioning problem in comparison with the graph model. 
Performance of the MFA is compared with the performances of Kerniglian-Lin 
and simulated annealing heuristics, using randomly generated circuit partition­
ing problem instances. Performance of MFA is close to simulated ann'ealing, 
and better than Kernighaii-Lin heuristic in some instances. Execution time of 
MFA is less than simulated annealing, but more than Kernighan-Lin luMiristic. 
Obtained results indicate that MFA can be used as an alternative heuristic for 
solving the circuit partitioning problem. MFA algoritlims proposc'd for .solv­
ing the circuit partitioning problem are parallelized and implemeiiled on an 
iPSC/2 hypercube multicomputer. Experimental results show that the pro­
posed heuristics can be efficiently parallelized on hypercube multicomputers, 
which is crucial for algorithms that solve such computationally hard i)roblems.
Results obtained in this work indicates that MFA which is originally pro­
posed for solving the traveling salesperson problem also works for tlie circuit 
partitioning problem and the mapping problem, and can be used as a general 
tool for solving combinatorial optimization problems. Scalability of the algo­
rithm is quite good, rea.sonable results are obtained for large i)roblem sizes. 
Performance of the proposed MFA algorithms may be improved by fiiui tuning 
of the temperature schedule of the algorithm, which still remains as a research 
issue.
Inherent parallelism of Mk'A is ex|)loited in this work by (h'signing ('flici(Mit
CHAPTER ix CONCLUSIONS 65
parallel MFA algorithms. Parallelization of heuristics, proposed for solving 
NP-hard combinatorial optimization problems, is important since the combina­
torial optimization problems are computationally hard problems. Development 
of parallel computers increases the need for heuristics that can be eificiently 
parallelized. Results obtained in this work show that MFA is a good candidate 
for developing efficient parallel heuristics. Proposed parallel MFA algorithms 
are expected to scale on parallel architectures, due to their lixed coininmiica- 
tion requirements.
Bibliography
[1] Arora, R. K., and Rana, S. P., “Heuristic algorithms for process assign­
ment in distributed com|)uting systems,” information Processing Letters, 
vol. 11, no. 4-5, pp. 199-203, 1980.
[2] Bokhari, S. H. “On the mapping problem,” IEEE Trans. Comput., vol. 30, 
no. 3, pp. 207-214, 1981.
[3] Brandt, R. D., Wang, Y., Laub, A. .J., Mitra, S. K. “Alternative Net­
works for Solving the TSP and the List-Matching Problem,” IEEE hit. 
Conference on Neural Nets, Vol.II, pp. 333-340, July 1988.
[4] Bultan, T., and Aykanat, C. “Parallel mean field algorithms for the so­
lution of combinatorial optimization problems,” Proc. ICANN-91, vol. 1, 
pp. 591-596 , 1991.
[5] Bultan, T., and Aykanat, C. “Circuit Partitioning Using Parallel Mean 
Field Annealing Algorithms,” Proc. 3rd IEEE Symposium on Parallel Pro­
cessing, to be published.
[6] Erçal, F., Ramanujam, J., and Sadayappan, P. “Task allocation onto a hy- 
IXMT.ubc^  I.)y ı■(^ ('UI■siV(i miiir.iit l)ip;u titioning,” ./. Parallel Dislrih. ('ompul.
vol. 10, pp. 35-44, 1990.
[7j Fiduccia, C. M., and Mattheyses, R. M. “A linear heuristic for improving 
network partitions,” in Proc. Design Automat. Conf, pp. 175-181, 1982.
[8] Garey, M. R., and Johnson, D. S. Computers and Intractability. San Fran­
cisco, CA; Freeman, |)p. 209-210, 1979.
[9] Hopfield, J. J. “Neural Networks and Physical Systems with Fmergxnit Col­




[JO] Hopfielcl, .1. .J. “Neurons with Graded Response Have Collective'(iompn- 
tational Properties Like Those of Two-State Neurons,” Proc. Natl. Acad. 
Sci. U.S.A., vol. 81, pp. 3088-;i092, 1984.
[11] Hopfield, .J. .J., and Tank, D. W. “ ‘Neural’ Computation of*Decisions in 
Optimization Problems,” Biolog. Cybern., vol. 52, pp. 141-152, 1985.
[12] Hopfield, .J. .J., and Tank, D. W. “Computing with neural circuits; a 
model,” Science, Vol. 233, pp. 625-633, August 1986.
[13] Hopfield, J. and Tank, D. W. “Collective computation in neuronlike 
circuits,” Scientific American, 257(6):104-114, 1987.
[14] Hegde, S. U., Sweet, .J. L., and Levy, W. B. “Determination of Parameters 
in H()|)fi<'ld/T;ud< Com|)ut;\.(,ioiia,j Network,” ll'll'll'l Ini. ('onj. N cti.ra l
Networks, vol. 2, pp. 291-298, 1988.
[15] Indurkhya, B., Stone H. S., and Xi-Cheng, L. “Optimal partitioning of 
randomly generated distributed programs,” IEEE Trans. Software Engrg., 
vol. 12, no. 3, pp. 483-495, 1986.
[16] Kasahara, H., and Narita, S. “Practical multiprocessor scheduling algo­
rithms for efficient parallel processing,” IEEE Trans. Coinput., vol. 33, 
no. 11, pp. 1023-1029, 1984.
[17] Kernighan, B. W., and Lin, S. “An efficient heuristic procedure for i)arti- 
tioning graphs,” Bell Syst. Tech. J., vol. 49, pp. 291-307, 1970.
[18] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. “Optimization by simu­
lated annealing,” Science, vol. 220, pp. 671-680, ,983.
[19] Krishnamurthy, B. “An improved min-cut algorithm for partitioning VLSI 
networks,” IEEE Trans. Cornput., vol. C-33, pp. 438-446, 1984.
[20] Lengauer, T. Combinatorial Algorithms for Integrated Circuit Layout. Wi­
ley, pp. 251-258, 1990.
BIBLIOGRAPHY (vS
[21] Peterson, C., and Anderson, .]. R. “Neural networks and NP-complete 
optimization problems; a performance study on the gra])h bisection i)rob- 
lem,” Complex Syst. vol. 2, pp. 59-89, 1988.
[22] Peterson, C., and Soderberg, B. “A new method for mapping optimization 
problems onto neural networks,” Ini. J. Neural Syst.., vol. 1, no. .8, 1989.
[23] Ramauujam, J., Erfal, F., and Sadayappan, P. “Task allocation by sim­
ulated annealing,” Proc. International Conference on Sxipercomputing. 
Boston, MA, May 1988, vol. Ill, Hardware & Software, pp. 475-497.
[24] H.aimuiujam, .1., <uul ,Sa.d;i.ya.pp<ui, 15 “Optimization l)y Neiiial Nel,works,” 
IEEE Int. Conference on Neural Nets, Vol.II, pp. 325-332, .July 1988.
[25] Sadayappan, P., and Er^al, F. “Nearest-neighbour mapping of linite ele­
ment graphs onto processor meshes,” IEEE Trans. Comput. vol. 36, no. 12, 
pp. 1408-1424, 1987.
[26] Sadayappan, P.,Ergal, F., and Ramanujam, .J. “Cluster partitioning ap­
proaches to mapping parallel programs onto a hypercube,” Parallel Com­
puting. vol. 13, pp. 1-16, 1990.
[27] Schweikert, D. G., and Kernighan, B. W. “A proper model for the par­
titioning of electrical circuits,” in Proc. 9th Design Automat. Workshop, 
pp. 57-62, 1979.
[28] Seitz, C. L. “The Cosmic Cube,” Com. of the /  CM, vol. 28, ¡)p. 22-23, 
1985.
[29] Shield, .J. “Partitioning concurrent VLSI simulation programs onto a mul­
tiprocessor by simulated annealing,” IEEE Proc. Part G, vol. 134, no. 1, 
pp. 24-28, 1987.
[30] Szu, H. “Fast TSP Algorithm Based On Binary Neuron Output and Ana­
log Neuron Input Using The Zero-Diagonal Interconnect Matrix And Nec- 
e.ssary And Sufficient Constraints Of The Permutation Matrix,” IEEE Int. 
Conference on Neural Nets, Vol.II, ))p. 259-266, .July 1988.
BIBLIOGRAPHY 69
[31] Tank, D. W., and Hopiield, J. J. “Simple ‘Neural’ optimization networks: 
An A/D converter, signal decision circuit, and a linear programming cir­
cuit,” IEEE Tra7is. Circ. Syst., Vol.cas-33, no,5, May 1986.
[32] Toomarian, N. “A Concurrent Neural Network Algorithm for the Traveling 
Salesman Problem,” Third Conference on Hypercube Concurrent Comptit- 
ers and Applications, Pasadena.
[33] Van den Bout, D. E., and Miller, T. K. “A Traveling Salesman Objective 
Function That Works,” IEEE Int. Conf. Nexu'al Nets, vol. 2, pp. 299-303, 
1988.
[34] Van den Bout, D. E. and Miller, T. K. “Improving the performance of 
the Hopfield-Tank neural network through normalization and annealing,” 
Biolog. Cyhern., vol. 62, pp. 129-139, 1989.
[35] Van den Bout, D. E., and Miller, T. K. “Graph partitioning using annealed 
neural networks,” IEEE Trans. Neural Networks, vol. 1, no. 2, pp. 192-203, 
1990.
[36] Wilson, G. V., and Pawley, G. S. “On the StaJ^ility of the Traveling Sales­
man Problem Algorithm of Hopfield and Tank,” Biolog. Cybern., vol. 58, 
pp. 63-70, 1988.
[37] Yih, J. S., and Mazumder, P. “A neural network de.sign for circuit par­
titioning,” IEEE Trans. Computer-Aided Design, vol. 9, p|). 1265-1271, 
1990.
