Mapping and FPGA global routing using Mean Field Annealing by Haritaoğlu, İsmail
Ш І Ш
$ f 3 Ш и
t  © І  й І  ti Ä P i  H II 11  ?i if  Г 8 Й  Si*i^'ílrsL iA>»Ö I î » .y
¡?í Л; « Й% ·ί#4'
ϊ i>. ?· <ít Η· Ψ  ^ τ Ρ Τ·) й в ^ ί'ί І ä Ш Ά
/Г,;.·λ ;■; ¡>.  ^t η 1<\ г· ,'4·.
.ϋΠί. «  *i-),!' irt* 'W íí''’ii|¿í
ti¡^ İt/-&>'á i i ¿ V íifí*(¿í»ij¡ <«%'u tt » IkÍWÍ* iií'ÍfV i W δ v^ .W^ 'oi i '*
njfilr.'İÎV Ѵ.‘У^  Î'Ÿi ju>'Jít'¿¿'*¿ '«/'¿'S''» ·.♦<* . «‘'ííijiií W  !».■<♦ ·ιί·ν·ΐί Jí i  W  Ѵй i U · * ϊ  W à···
ігхщ '·?Μ« fî>
it ï* i ' W  ‘v iîi'd ^ü / ·/ Í  it ■'«•if 4 ^  W  >4 ·4ι(τΜ U "‘» i ·ί ii. .V 4 'ii *Am1 ІГ»^ 4і Ч»<^ "w  *r ii»
#!'ï ?::?й '^■■;î4T î?ÎÎ?1Î“ i3î) !^;"7Îf«k.·'-V 3u«i V' Î - ii -t ic «îh4 ùÎW k >*< •ù<v«!' ’Î!t (il V. Ы
W* w* i ¡> •■A'» ·^ t·' * «» Ί ! ■í ü'if» Й >:tv^  ti a  ' v·'" «t « 4 Ж« > »
'Г·' ^  r й^ ■ ^ Г
■u'< w' k  ¿ ¿ .‘ш  íV  /Λ ψ -'І Λ 4 i tJ
« >^-ii·;; « wi-i* ··' ν-,'Λ’ί·:;' . ,? :;^  >f< ; .■¡i)»•Mí» Mié ·*«Ι *· V· "»W Wf W >«
J ·' ' · *' ·4*^·';,'·■'?H 4'J ώ'ύ*' л  '¡ χ ^  '4
j-'v, -íS -if'i, ■ ■ 'i·*^··' 'Ί ·.'*'* :■; ή « í.' '··■.., '^ í"^  = · ■·*■
*i.¿i ÿ. V|I> <14^  .t Ik' W í<Ki
« 3 ? ^  
i 9  f b
MAPPING
AND
FPGA GLOBAL ROUTING  
USING
M EAN FIELD ANNEALING
A THESIS
SUBMITTED TO THE DEPARTMENT OF COMPUTER 
ENGINEERING AND INFORMATION SCIENCE 
AND THE INSTITUTE OF ENGINEERING AND SCIENCE 
OF BILKENT UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS 
FOR THE DEGREE OF 
MASTER OF SCIENCE
By
Ismail Haritaoglu 
September, 1994
'Î4<
' 6 %
Иг- T
¿ 0 2 6 7 9 2
11
I certify that I have read this thesis and that in rny opinion it is fully adequate, 
in scope and in quality, as a thesis for the degree of Master of Science.
Asst. Proi..<^vdet Ay kanat (Advisor)
I certify that I have read this thesis and that in my opinion it is fully adequate, 
in scope and in quality, as gj^hesis for the degree of Master of Science.öJ;
r3=ri
Assoc. Prof. Ömer Benli
X '
I certify that I have read this thesis and that in my opinion it is fully adequate, 
in scope and in quality, as a thesis for the degree of Master of Science.
AssfT Prof. Mustafa Pınar
Approved for the Institute of Engineering and Science:
Prof. Mehmet Bar 
Director of the Institute
ABSTRACT
MAPPING
AND
FPGA GLOBAL ROUTING 
USING
MEAN FIELD ANNEALING
İsmail Haritaoğlu
M .S. in Computer Engineering and Information Science 
Advisor: Asst. Prof. Cevdet Aykanat 
September, 1994
Mean Field Annealing algorithm which was proposed for solving combinatorial 
optimization problems combines the properties of neural networks and Simu­
lated Annealing. In this thesis, MFA is formulated for mapping problem in 
parallel processing and global routing problem in physical design automation 
of Field Programmable Gate Array (FPGAs) A new Mean Field Annealing 
(M FA) formulation is proposed for the mapping problem for mesh-connected 
and hypercube architectures. The proposed MFA heuristic exploits the conven­
tional routing scheme used in mesh and hypercube interconnection topologies 
to introduce an efficient encoding scheme. An efficient implementation scheme 
which decreases the complexity of the proposed algorithm by asymptotical fac­
tors is also developed. Experimental results also show that the proposed MFA  
heuristic approaches the speed performance of the fast Kernighan-Lin heuris­
tic while approaching the solution quality of the powerful simulated annealing 
heuristic. Also, we propose an order-independent global routing algorithm for 
SR AM  type FPG As based on Mean Field Annealing. The performance of the 
proposed global routing algorithm is evaluated in comparison with LocusRoute 
global router on ACM/SIGDA Design Automation benchmarks. Experimen­
tal results indicate that the proposed MFA heuristic performs better than the 
LocusRoute.
Ill
IV
Keywords: Mapping, Global Routing, Field Programmable Gate Arrays, Mean 
Field Annealing
ÖZET
ORTA ALAN T A V LA M A  M E T O D U  KU LLANILARAK  
EŞLEME VE F P G A T E R D E K İ K A B A  RO TALAM A  
PROBLEM LERİNİN Ç Ö ZÜ M Ü
İsmail Haritaoğiu
Bilgisayar ve Enformatik Mühendisliği, Yüksek Lisans 
Danışman; Yrd. Doç. Dr. Cevdet Aykanat 
Eylül, 1994
Birleşimsel eniyileme problemlerini çözmek için önerilen Ortak Alan Tavlama 
(Mean Field Annealing) algoritması, sinir ağları ve benzetimsel tavlama (Sim­
ulated Annealing) yöntemlerinin özelliklerini taşır. Bu çalışmada. Ortak 
Alan Tavlama algoritması Alan Programlamalı Kapı Devrelerinin (Field Pro­
grammable Gate Arrays) kaba rotalama problemine (Global Routing) ve par­
alel programlamadaki eşleme (Mapping) problemlerine uyarlanmıştır. Tezin 
ilk bölümünde Ortak Alan Tavlama algoritması Alan Programlamalı Kapı 
Devrelerinin (Field Programmable Gate Arrays) kaba rotalama problemi- 
ninin çözümünde kullanılmıştır. Önerilen algoritmalarının başarımları Locus- 
Route kaba rotalama algoritması ile kıyaslanarak değerlendirilmiştir. Deneyler 
algoritmaları karşılaştırmak için kullanılan standart devreler (Benchmarks) 
üzerinde yapılmıştır. Elde edilen sonuçlar Ortak Alan Tavlama algoritmasının 
kaba rotalama problemini çözmek için iyi bir alternatif algoritma olarak kul­
lanılabileceğini göstermektedir. Tezin ikinci bölümünde Mesh ve Hiperküp 
tipindeki paralel bilgisayarlarındaki eşleme problemi için daha önce önerilen 
algoritmalardan daha hızlı olan bir algoritma geliştirilmiş ve bu önerilen algorit­
manın başarımları Kernighan-Lin, Simulated Annealing ve daha önce önerilen 
ortak alan tavlama metotları ile kıyaslanarak değerlendirilmiştir.
Anahtar Sözcükler: Orta Alan tavlama algoritması, Eşleme problemi. Kaba 
rotalama algoritmaları. Alan programlamlı kapı devreleri
IV
ACKNOWLEDGEMENTS
I would like to express my deep gratitude to my supervisor Dr. Cevdet Aykanat 
for his guidance, suggestions, and invaluable encouragement throughout the 
development of this thesis. I would like to thank Dr. Ömer Benli for reading 
and commenting on the thesis. I would also like to thank Dr. Mustafa Pınar for 
reading and commenting on the thesis. I owe special thanks to Dr. Mehmet 
Baray for providing a pleasant environment for study. I am grateful to my 
family and my friends for their infinite moral support and help.
Bu çalışmamı,
herşeyimi borçlu olduğum anneme, babama, 
ve
Esine
adıyorum.
VI
Contents
1 INTRODUCTION 1
2 MEAN FIELD ANNEALING 4
2.1 Mean Field Annealing .................................................................................  4
2.1.1 Ising M o d e l ......................................................................................  5
2.1.2 Potts M o d e l.......................................................................................  6
2.1.3 MFA A lgorithm ...............................................................................  8
3 FPGAs & GLOBAL ROUTING 9
3.1 Introduction to Field Programmable Gate A r r a y s .........................  9
3.1.1 Logic B locks......................................................................................  10
3.1.2 Programming Technologies...........................................................  10
3.1.3 Routing A rch itectu res................................................................... 11
3.2 Physical Design Automation of F P G A s ................................................  15
3.2.1 P artitioning........................................................................................  15
3.2.2 Placement 15
3.2.3 R o u t in g ................................................................................................ 15
3.3 Global Routing Problem in Design Automation of FPGAs . . .  16
vii
3.4 Model of FPGA for Global R o u tin g ........................................................ 17
4 MFA SOLUTION FOR GLOBAL ROUTING IN FPGA 22
4.1 MFA Formulation of Global R o u t in g ....................................................  22
4.2 Im plem entation................................................................................................  25
4.3 Experimental R esu lts .....................................................................................  27
5 THE MAPPING PROBLEM 33
5.1 The Mapping P r o b le m .................................................................................  33
5.2 The Model of Mapping P rob lem ...............................................................  35
6 MFA SOLUTION FOR MAPPING 39
6.1 General MFA Formulation for Mapping P roblem .............................. 39
6.2 Interconnection-Topology Specific MFA Formulation for Mapping 42
6.2.1 MFA formulation for Mesh-Connected Architectures . . .  42
6.2.2 MFA Formulation For Hypercube Architecture ....................51
6.3 Performance Evaluation .............................................................................. 56
6.4 Experimental R esu lts .....................................................................................  59
7 CONCLUSION 69
CONTENTS viii
List of Figures
2.1 Mean Field Annealing A lg o r ith m ............................................................ 8
3.1 The Architecture of General F P G A ........................................................  11
3.2 Example of flexibilities of FPGA (a) flexibility of switch block
(b) flexibility of connection b lo c k ................................................  12
3.3 The Architecture of Xilinx 3000 F P G A ................................................  13
3.4 The Architecture of Actel F P G A ...........................................................  14
3.5 General approach to FPGA routing a) Global routing b) De­
tailed r o u t in g .................................................................................................... 16
3.6 Sample two bends ro u te s .............................................................................. 17
3.7 The FPG A model used for Global R o u tin g ........................................... 18
3.8 (a) The routing area of the two-pin net and its subnets, (b) The
possible routes for each s u b n e ts .............................................................  19
3.9 The Cost Graph for FPGA m o d e l...........................................................  20
4.1 Channel density distribution obtained by M FA for the circuit
C1355 ..................................................................................................................  32
4.2 Channel density distribution obtained by LocusRoute for the
circuit Cl355 ...................................................................................................  32
4.3 SEGA detailed router results of the circuit Cl355 for the global
routing solutions obtained by (a) MFA (b) L ocu sR ou te ...............  32
ix
LIST OF FIGURES
5.1 An example of mapping problem 38
6.1 The proposed efficient MFA algorithm for the mapping problem
for mesh-connected Architectures. 48
6.2 Three different ways for dividing 3-dimensional hypercube to 2
2-dimensional subcubes............................................................................. ...  52
6.3 The Mean field value calculation of given spin i of subcube . 56
List of Tables
4.1 M CNC benchmark circuits used in ex p erim en ts ..................... 27
4.2 The Global Router r e s u lts .................................................................  28
4.3 The SEGA detailed routing results in area optimization mode . 29
4.4 The SEGA detailed routing results in speed optimization mode . 30
4,5' Minimxm Channei Width for 100% ro u tin g ........................................  31
6.1 Total communication costs averages normalized with respect to
mesh-specific MFA of the solution found by SA,KL,general MFA  
and mesh-specific MFA for randomly generated mapping prob­
lem instances for various mesh s i z e .......................................................  59
6.2 Percent computational load imbalance averages of the solution 
found by SA,KL,general MFA and mesh-specific MFA for ran­
domly generated mapping problem instances for various mesh
s iz e .........................................................................................................................  60
6.3 Execution time averages of the solution found by SA,KL,general
MFA and mesh-specific MFA for randomly generated mapping 
problem instances for various mesh s iz e ................................................ 60
6.4 Average performance measures of the solution found by SA, KL,
general MFA and mesh-specific MFA for randomly generated 
mapping problem instances.........................................................................  61
6.5 The Benchmark Sparce Matrix data used in experiments . . . .  62
XI
LIST OF TAFiLES Xll
6.6 Total communatication cost averages, normalized with respect
to mesh-specific MFA, of the solution found by SA,KL, general 
MFA and mesh-specific MFA for some bechmark mapping prob­
lem instances for various mesh s i z e .......................................................  63
6.7 Load Imbalanced averages,of the solution found by SA ,KL, gen­
eral MFA and mesh-specific MFA for some bechmark mapping 
problem instances for various mesh s i z e ................................................ 64
6.8 Total execution time, normalized with respect to mesh-specific
MFA, of the solution found by .SA,KL, general MFA and mesh- 
specific MFA for some bechmark mapping problem instances for 
various mesh s i z e ............................................................................................ 65
6.9 Average performance measures of the solutions found by SA,
KL, general M FA and mesh-specific MFA for mapping problem 
instances................................................................................................................  66
6.10 Total communication costs averages normalized with respect to
hypercube-specific MFA of the solution found by SA,KL,general 
MFA and hypercube-specific MFA for randomly generated map­
ping problem instances for various hypercube s i z e ...........................  66
6.11 Percent computational load imbalance averages of the solution
found by SA,KL,general MFA and hypecube-specific MFA for 
randomly generated mapping problem instances for various hy­
percube s i z e .........................................................................................................  67
6.12 Execution time averages of the solution found by SA,KL,general
MFA and hypercube-specific MFA for randomly generated map­
ping problem instances for various hypercubesize...............................  67
Chapter 1
INTRODUCTION
A common property of both domain mapping problem in parallel processing 
and global routing in VLSI is that both problems are combinatorial optimiza­
tion problems. As many problems in VLSI, parallel processing and other areas, 
these algorithms involve a finite set of configuration from solutions satisfying 
a number of rigid requirement are selected. The objective of combinatorial 
optimization algorithm is to find a solution of the optimum cost provided that 
a cost can be assigned to each solution. Many combinatorial optimizations 
problems are hard in the sense that they are NP-hard problems. There are no 
known deterministic polynomial time algorithms to find the optimal solution 
to any of those hard problems. The algorithms using the complete enumeration 
techniques are usually exponential in the size of problem, therefore they require 
a great amount of time to find the optimal solution. As a result, heuristics that 
run in a low order polynomial time have been employed to obtain good solu­
tions to these hard problems. Disadvantage of heuristics is that they may get 
stuck in local minima.
A powerful method for solving combinatorial optimization problem used in 
previous research is called Simulated Annealing. This method is the applica­
tion of a successful statistical method, which is used to estimate the results of 
annealing process in statistical mechanics, to combinatorial optimization prob­
lems. Simulated Annealing is a general method that guarantees to find the 
optimal solution if time is not limited. But time needed for Simulated Anneal­
ing is too much and exact solution of NP-hard problems are still intractable. 
Properties of Simulated Annealing are that, it can be used as a heuristic to ob­
tain near optimal solutions in limited time, and as the time limit is increased.
CHAPTER 1. INTRODUCTION
quality of the obtained solutions also increase. An important property of Sim­
ulated Annealing is the ability to escape from local minima if sufficient time is 
given. Simulated Annealing has been applied to various NP-hard optimization 
problem and for most problem it gives good results.
The subjects of this thesis is a recent algorithm, called Mean Field An­
nealing (M FA) was originally proposed for solving the traveling salesperson 
problem. MFA is general strategy and can be applied to various problem with 
suitable formulations. Work on MFA showed that, it can be successfully ap­
plied to combinatorial optimization problems. Mean Field Annealing (M FA) 
merges collective computation and annealing properties of Hopfield Neural Net­
works (HNN) and Simulated Annealing (SA), respectively, to obtain a general 
algorithm for solving combinatorial optimization problems. MFA can be used 
for solving a combinatorial optimization problem by choosing a representation 
scheme in which the final states of the spins can be decoded as a solution 
to the target problem. Then, an energy function is constructed whose global 
minimum value corresponds to the best solution of the problem to be solved. 
MFA is expected to compute the best solution to the target problem, starting 
from a randomly chosen initial state, by minimizing this energy function. In 
this thesis, MFA is formulated for the mapping problem in parallel processing 
and global routing problem in design automation of Field Programmable Gate 
Arrays.
The first combinatorial optimization problem, that is solved by MFA in this 
thesis, is global routing problem in design automation of field programmable 
gate arrays. This study investigates the routing problem in Static RAM  Field 
Programmable Gate Arrays (F P G A ’s) implementing the non-segmented (Xil- 
inx based) network [27]. As the routing in F P G A ’s is a very complex combina­
torial optimization problem, routing process can be carried out in two phases 
; global routing followed by detailed routing [11]. Global routing determines 
the course of wires through sequences of channel segments. Detail routing 
determines the wire segment allocation for the channel segment routes found 
in the first phase which enable feasible switch box interconnection configura­
tions [2-5, 14]. Global routing in FPGA can be done by using global routing 
algorithm proposed for standard cells [25]. LocusRoute global router is one of 
this type of router used for global routing in F P G A ’s [24] which divides the 
multi pin net’s into two-pin net’s and considers only minimum distance routes 
for these two-pin nets. The objective in the LocusRoute is to distribute the 
connections among channels so that channel densities are balanced. In this
CHAPTER 1. INTRODUCTION
thesis, we propose a new approach the solution of global routing problem in 
F P G A ’s by using Mean Field Annealing technique.
Second problem that is solved by MFA is the Mapping problem [4, 8, 29]. 
The mapping problem arises as parallel programs are developed for distributed 
memory architectures. Various classes of problems can be decomposed into a 
set of interacting sequential subproblems (tasks) which can be executed in par­
allel. In these classes of problems, the interaction patterns among the tasks is 
static. In a distributed-memory architecture, a pair of processors communicate 
with each other over a shortest path of links connecting them. Hence, commu­
nication between each pair of processors can be associated with relative unit 
communication cost. Unit communication cost between a pair of processors can 
be assumed to be linearly proportional to the shortest path distance between 
those two processors. The objective in mapping subproblems to processors of 
multicomputers is the minimization of the expected execution time of the par­
allel program on the target architecture. Thus, the mapping problem can be 
modeled as an optimization problem by associating the following quality mea­
sures with a good mapping : (z) interprocessor communication overhead should 
be minimized, (ii) computational load should be uniformly distributed among 
processors in order to minimize processor idle time. The mapping problem 
has been solved by using Simulated Annealing, Kernighan-Lin type heuristic 
before. Also the MFA has been formulated in [6, 5]. But this formulation was a 
general formulation for any type of multicomputer whose intercommunication 
topologies are known. In this thesis we propose an efficient MFA formulation 
for topology-specific mapping for 2D-mesh and hypercube. For each intercon­
nection topology, the efficient MFA formulation is given instead of using one 
general formulation as in [6].
In Chapter 2 the theory of the Mean Field Annealing heuristic and its en­
coding models are explained. The Field Programming Gate arrays, its design 
automation and Global Routing problem are introduced in Chapter 3. Also the 
FPG A model for global routing problem are proposed in this chapter. Chap­
ter 4 gives the MFA formulation of global routing problem in FPGAs design 
automation. The mapping problem are introduced in Chapter 5. Chapter 6 
presents general MFA formulation the topology-specific MFA formulation for 
Domain Mapping problem. Finally, conclusion of thesis are stated in Chapter?.
Chapter 2
M EAN FIELD ANNEALING
In this chapter the Mean Field Annealing (MFA) heuristic is introduced and 
its models are given.
2.1 Mean Field Annealing
Mean Field Annealing (M FA) merges collective computation and annealing 
properties of Hopfield Neural Networks (HNN) and Simulated Annealing (SA), 
respectively, to obtain a general algorithm for solving combinatorial optimiza­
tion problems. HNN is used for solving various optimization problems and 
reasonable results are obtained for small size problems. However, simulations 
of this network reveals the fact that it is hard to obtain feasible solutions for 
large problem sizes. Hence, the algorithm does not have a good scaling prop­
erty, which is a very important performance criterion for heuristic optimization 
algorithms. MFA is proposed as a successful alternative to HNN. In the MFA  
algorithm, problem representation is identical to HNN, but iterative scheme 
used to relax the system is different. MFA can be used for solving a combi­
natorial optimization problem by choosing a representation scheme in which 
the final states of the spins can be decoded as a solution to the target prob­
lem. Then, an energy function is constructed whose global minimum value 
corresponds to the best solution of the problem to be solved. MFA is expected 
to compute the best solution to the target problem, starting from a randomly 
chosen initial state, by minimizing this energy function. Steps of formulating 
MFA technique for a combinatorial optimization problem can be summarized 
as follows :
4
CHAPTER 2. MEAN FIELD ANNEALING
• Choose a representation scheme which encodes the configuration space 
of the target problem using spins. In order to get a good performance, 
number of possible configurations in the problem domain and the spin 
domain must be equal, i.e., there must be a one-to-one mapping between 
the configurations of spins and the problem.
• Formulate the cost function of the problem in terms of spins, i.e., de­
rive the energy function of the system. Global minimum of the energy 
function should correspond to the global minimum of the cost function.
• Derive the mean field theory equations using this energy function, i.e., 
derive equations for updating expected values of spins.
• Minimize the complexity of update operations in order to get an efficient 
algorithm.
• Select the energy function and the cooling schedule parameters.
The M FA algorithm is derived by analogy to Ising and Potts model which 
are used to estimate the state of a system of particles, called spins, in thermal 
equilibrium.
2.1.1 Ising Model
In Ising model spins can be in one of two states represented by 0 and 1. In the 
Ising model, the energy of a system with S spins has the following form:
1 (2.1)
k=l1фк fc=l
Here, indicates the level of interaction between spins к and /, and Sk € {0 ,1 }  
is the value of spin k. It is assumed that ¡3ki =  Pik and =  0 for I < k,l,< S. 
At thermal equilibrium, spin average (sjt) of spin к can be calculated using 
Boltzmann distribution as follows
1
-f e-'t'k/T (2.2)
Here, <f>k =  {H{s))\s =^Q — {H{s))\s =^i represents the mean field effecting on spin 
k, where the energy average {H(s)) of the system is
CHAPTER 2. МЕЛА' FIELD ANNEALING
( / /( s ) )  — ^  Y20ki{skSi) +  ^  hk{sk)
h=l Ijtk k=l
(2.3)
The complexity of computing 4>k using Eq.2.3 is exponential. However, for 
large number of spins, mean field approximation can be used to compute the 
energy average as
(^ (s)> = i E  E M M  + E  M  (2.4)
“  k=l  l^k k=l
Since {H{s)) is linear in (sk), mean field <f>k can be computed using the following 
equation.
h  = (/i(s)>l,.=o -  {//(s))i„=. = = -  \ E M  + (2.5)
2.1.2 Potts Model
In the Potts model, spins can be in one of the K  states. In state Potts model 
of S spins, the states of spins are represented using S /f-dimensional vectors 
Sj =  [5 ,1 , . . . ,  Sik, . . . ,  1 <  i <  -S, where “i” denotes the vector transpose
operation.
The spin vector Si is allowed to be equal to one of the principal unit vectors 
e i , · . · , ©k) · · ·) ®K) and can not take any other value. Principal unit vector 
is defined to be a vector which has all its components equal to 0 except its ¿ ’th 
component which is equal to 1. Spin Si is said to be in state k if it is equal 
to efc. Hence, a K  state Potts spin Si is composed of two state variables 
Sii,.. .  ,Sik,... 1 SiKi where s,·* € {0 ,1 } , with the following constraint
к
Y^Sik = G ^ < i < S
k=l
(2.6)
In the Potts model, the energy of a system with S K-state Potts spin has the 
following form:
^  =  i E E f t - s , s ,  +  E ' A
i= l j:^ i i=l
(2.7)
Here, fiij indicate the level of interaction between spins i and j, and interaction 
between Potts spins S ,Sj is formulated as i^kSji· Therefore we
CHAPTER 2. MEAN FIELD ANNEALING
can formulate the energy of the system as
s к к
 ^ t = l k=ll=l
(2.8)
1=1 1=1
Here, Sik e 0,1 is the value of A:th state of the Potts spin i. At thermal 
equilibrium, spin average (s,jt) of spin г can be calculated using Boltzmann 
distribution as follows
{Sik) =
оФ.к/Т
(2.9)
Here, (sik) €  [0,1]. Note that Sik can be 0 or 1 but (s,jfc) can be any real value 
between 0 and 1. represents the mean field effecting on state k of spin i. 
The mean field value for Potts spin i can be formulated as
fe = {i^ (s))ls,=o-{//(s))|s.=
. M M
d{sik) Z  Zb¥»· i=i
(2.10)
(2.11)
At each temperature, starting with initial spin averages, the mean field 
effecting on a randomly selected spin is found using Eqs. (2.5) and (2.10). 
Then, spin average is updated using Eq. (2.2) and Eq. (2.9) This process is 
repeated for a random sequence of spins until the system is stabilized for the 
current temperature. M FA algorithm tries to find equilibrium point of a system 
of S spins using annealing process similar to SA,. The state equations used in 
MFA are isomorphic to the state equations of the neurons in the HNN. A  
synchronous version of M FA, can be derived by solving N difference equations 
for N spin values simultaneously. This technique is identical to the simulations 
of HNN done using numerical methods. Thus, evolution of a solution in a 
HNN is equivalent to the relaxation toward an equilibrium state affected by 
the MFA algorithm at a fixed temperature [9]. Hence MFA can be viewed 
as an annealed neural network derived from HNN. HNN and SA methods 
have a major difference: SA is an algorithm implemented in software, whereas 
HNN is derived with a possible hardware implementation in mind. MFA is 
somewhere in between, it is an algorithm implemented in software, having 
potential for hardware realization [8, 9]. In this work, MFA is treated as a 
software algorithm as SA. Results obtained are comparable to other software 
algorithms, conforming this point of view.
CHAPTER 2. MEAN FIELD ANNEALING
l.Get the Initial temperature Tq, and set T = Tq 
2.Initialize spin averages
king spin : [(ui), («2), . . . ]
Potts spins :[(Si), (82), · · ·]
3.WHILE temperature T is in the cooling range DO
4. WHILE system is not stabilized for the current temperature DO
Select a spin i at random
4.¡Compute mean field affecting on spin i
Ising spin : compute (f>i -  E'(U)|t,,=o -
Potts spins : compute <t>i =  [<t>i\,<i>i2, · · ·, <t>iKY such that
= ^ (S )|s ,^0 -  ^(S)|s.=e* for i  =  1 , 2 , . . AT 
4.2Update the average value of spin i 
Ising spin: {ui) =  -f
Potts spin : {sik) =  ioT k =  1,2, . . . ,  K
5. Update T according to the cooling schedule
Figure 2.1. Mean Field Annealing Algorithm
2.1.3 MFA Algorithm
The Mean Field Annealing algorithm are summarized in Figure 2.1.2. Begin­
ning of the algorithm, the initial temperature are initialized and the current 
temperature is set to that initial value (step 1). After that Ising and Potts 
spins are initialized (step 2). Then, the annealing property of MFA are begin. 
In cooling schedule, the system tries to reach a stable state for each tempera­
ture until most of spins converges a stable state. For each temperature, while 
the system is not in stable state, a spin is selected randomly (step 4.1), and 
mean field values of spins are calculated (step 4.2) in order to update the spin 
values (step 4.3). When the system reaches the stable state, the temperature 
decreased by cooling schedule (step 5). At the end of algorithm, when most of 
spins converge, spins are decoded for a solution of target problem.
Chapter 3
FPGAs & GLOBAL ROUTING
This chapter introduces the Field Programmable Gate Arrays and its physical 
design automation steps briefly. Routing architectures of F P G A ’s are men­
tioned in this chapter and global routing problem and its previous solutions 
are given at the end of this chapter. Also the global routing problem in FPGAs 
is modeled in this chapter.
3.1 Introduction to Field Programmable Gate Arrays
Field Programmable gate arrays (FPGAs) are new electrically programmable 
integrated circuits that provide high integration and rapid turnaround time. 
In VLSI design automation, the fabrication tirne is important problem. In 
order to reduce time to fabricate interconnects, programmable devices have 
been introduced. FPGA is very popular programmable devices used in ASIC  
design market.
FPG A can reduce manufacturing turnaround time and cost. In its simplest 
form, an FPGA consists of an array of programmable logic blocks and routing 
network to interconnect the logic blocks. The programmable logic blocks can 
be programmed by the user to implement a small logic function. An important 
property of FPGA is re-programmability by using electrically programmable 
switches. Commercial F P G A ’s differ in the type of programming technology 
used, in architecture of logic blocks and their routing architectures. An FPGA  
logic blocks can be as simple as transistor or as complex as a microprocessor.
CHAPTER 3. FPGAS S¿ GLOBAL ROUTING 10
It is typically capable of implementing many different combinational and se­
quential logic functions. F P G A ’s logic blocks can be classified as transistors 
pairs, basic small gates (such as two-input N A N D ’s), multiplexes and Look-up 
tables.
3.1.1 Logic Blocks
FPGAs logic blocks differ greatly in their size and implementation capability. 
The two transistor logic block can only implement an inverter but is very small 
in size, while look-up table logic blocks used in Xilinx FPGAs can implement 
any five-input logic function but they are significantly larger. Logic blocks 
can be classified in terms of granularity. Granularity can be defined in various 
ways, for example, as the number of boolean function that the logic block can 
implement, the number of equivalent two input NAND gates, total number 
of transistors, number of inputs and outputs. But generally, the commercial 
logic blocks can be classified into two categories: fine-grain and coarse-grain. 
Main advantage of using fine grain logic blocks is that the use-able blocks are 
fully utilized. However the main disadvantage of fine-grain blocks is that they 
require a relatively large number of wire segments and programmable switches.
3.1.2 Programming Technologies
An FPG A is programmed using electrically programmable switches. Accord­
ing the properties of these programmable switches such as, on-resistance and 
capacitance, programming technologies can be classified into three main types. 
These three types are SRAM  , antifuse and EPROM  programming technolo­
gies.
The SR AM  programming technologies uses static R AM  cells to control the 
gates and multiplexes. In SR AM , the switch is a pass transistor controlled by 
the state of a SRAM  bit. Therefore, SRAM  is volatile. Hence The FPGA must 
be loaded and configured at the time of chip power-up, it requires external per­
manent memory to provide the programming bits such as PROM  or EPROM . 
A major disadvantage of SRAM  programming technology is its large area ( its 
takes at least five transistors to implement an SRAM  cell). However, SRAM  
programming technology has fast re-programmability as an advantage of it.
CHAPTER 3. FPGAS ¿z GLOBAL ROUTING 11
Architecture of FPGA
Wiring Scgnicnts
outing Channel
Logic Block Connection Block Switch Block
Figure 3.1. The Architecture of General FPGA
An antifuse is a two terminal device with an unprogrammed state presenting 
a very high resistance between its terminals. When a high voltage is applied 
across its terminals, the antifuse will blow and create low resistance link. This 
link is permanent. Programming an antifuse requires extra circuitry to deliver 
the high programming voltage and a high current. A major advantage of the 
antifuse is its small size.. This advantage is reduced by the large size of the 
necessar_v programming transistors·.
The floating gate programming technology uses technology found in ultra­
violet erasable EPROM  and electrically erasable EEPROM . Major advantage 
of EPROM technology' is its fast reprograramability. Also it does not require 
extra permanent memory to program the chip on power-up. However this tech­
nology increase the number of processing steps and high resistance transistors.
3.1.3 Routing Architectures
The routing architecture of an FPG A is the manner in which the programmable 
switches and wiring segments are positioned to allow the programming inter­
connection of the logic. Figure 3.1 illustrates a typical routing architecture 
model. Before giving some commercial FPGA routing architecture, giving 
some definition is helpful for understand routing problem in FPGA. A wire
CHAPTER 3. FPGAS & GLOBAL ROUTISG 12
Wiring
Segments
Logic U ·  
Block :
Fs=5
---
Logic
Block
Fc=3
(a) (b)
Figure 3.2. Example of flexibilities of FPGA (a) flexibility of switch block 
(b) flexibility of connection block
segment is a wire unbroken by programmable switches. One or more switches 
may attach to the wire segment. Each end of wire segment has a switch at­
tached.
A track is sequence of one or more wire segments in a line.
A routing channel is group of parallel tracks as in Figure 3.1.
As shown in Figure 3.1, the model contains two basic structures: Connec­
tion blocks and switch blocks. A connection block provides connectivity from 
the input and output of logic blocks to the wire segments in the channels. A 
switch block provides connectivity between the horizontal as well as the vertical 
wire segments.
As in Figure 3.2, The general routing structure of FPG A has two impor­
tant interconnection block. These are connection blocks which are used to make 
connections between logic block pin and routing segments, and switch blocks 
where connections are switched at the intersection of horizontal and vertical 
channels. The number of switching in connection and switch blocks is impor­
tant for good routability. Large number of switching increase the routability 
but it causes poor performance and large delay and also large area.
The number and distribution of switches used in interconnection called 
flexibility of an FPG A. Flexibility of switch blocks (F ,) and flexibility of con­
nection block {Fc) can be defined as the number of choices offered to each wire 
enter a switching block or a connection block, respectively. The flexibility of 
switch block F, is defined to be total number of possible connection offered
CHAPTER 3. FPGASL· GLOBAL ROETISG 13
LB:
Swiictq
Block
LB;
Swiicl·
Block
LB;
General Purpose 
Interconnect
LB :
9
Switch
BkKki
SwitcH
Blexrk
e
~ Long Lines 
(Horizontal)
Direct
Interconnect
:LB:
I I
Long Lines 
(Vcnical)
(a)
Routing
Switch
(b )
Figure 3.3. The Architecture of Xilinx 3000 FPGA
to each wire segment. The flexibility of connection block Fc is defined as the 
number of wires that each logical pin of logic block can connect. Next section 
describes the important routing architecture of commercial F P G A ’s such as 
Xilinx and Actel.
The Xilinx Routing Architecture
Figure 3.3 illustrates the routing architecture used in the Xilinx 3000 series 
FP G A . Connections are made from the logic block into the channel through 
a connection block. Since each connection site is targe because of the SRAM  
programming technology, the Xilinx 3000 connection blocks connects each pin 
to only two or three out of five tracks passing by a block . On all four sides 
of the logic block there are connection blocks that connect a total of 11 dif­
ferent logic block pins to the wire segments. Once the logic pin is connected 
via the connections block makes connections between segments in intersecting 
horizontal and vertical channels. Each wire segment can connect to five or six 
out of a possible 15 wire segments on the opposites sides. There are four types 
of wire segments provided in the Xilinx 3000 architecture:
-General-purpose interconnect consisting of wire segments that pass 
through switches in the switch block.
CHAPTER 3. FPGAS & GLOBAL ROUTING 14
Antifuse ^
Input Segment-
Wiring Segment“
LB LB LB L 3 LB
i  H
C  ^ i
: 7^ 
' 1 t
} · < H H )-----------------
____ . k. r if f TX \\ f H H  \ / )-----------------
_______0 _-_______________ _
1.) \ 
f \ f
7~\
\ f
tX
\ f
} ■ \ 
\ f
r t )-----------------
c rr rX
\ i >1 (
t-----------1 H H )-----------------
----------------___ i V
- 0 _____ _ ^s J K "  ^ f rX )-----------f H i-----------------
----------------------- Q __________^
s } K
'   ^ r
rX
\ c
tX
w )---------- i
H )-----------------
} KfX E)-----------( p —
LB LB LB L 3 LB
-----------&  -i Vi \ f \  f— . . Q ____ ^ Vi VS r \ f )---------- ( H )---------________ Q  ^ Vi J V5 ( J V >-----------( H )-----------------
-----------O f H Vi
tX >-----------( ^-----------------
------------------------------- © — -4 )_(
I K '---------- 1 >■< 
1 f N 4
i)-----------------
___Ci________ £ ) ( \ i *-----------€ M. )-----------------KJ \ _______ 0 ____(. \ ( \ i  1►-------- - i H H )-----------------f \ i>-----------i H )-----------------
LB LB LB L 3 LB
-Output Segment
‘ Vertical Track
Figure 3.4. The Architecture of Actel FPGA
-Direct interconnect consisting of wire segments that connect each 
logic block output directly to four nearest neighbors.
-Long lines, which span the length or width of the chip, providing 
high-fanout uniform delay connections.
-Clock line, which is a single net that spans the entire chip and is 
driven by a high-drive buffer.
The Actel Routing Architecture
The Actel routing architecture has a eisymmetric architecture because there 
are more general purpose tracks in horizontal direction than vertical direction. 
The connection block of the Actel routing architecture is shown in Figure 3.4. 
The connectivity of A CTEL FPCAs is different in input and output pins. For 
input pins, each pin can connect to all of the tracks in the channel that are on 
the same side as the pin. The output pins extend across two channels above the 
logic block and two channels below it. Output pins can connect to every track 
in all four channels that is crosses. There is no separable switch block in the 
Actel architecture. Instead, the switching is distributed throughout the hori­
zontal channels. All vertical tracks can make a connection with every incident 
horizontal tracks. Each horizontal channel consists of 22 routing tracks, and 
each track is broken up into segments of different lengths. There are three type 
of vertical segments: input segments, output segments and freeways that either 
travel the entire height of chip, or some significant portion of it. This allows 
signal to travel longer vertical distance than permitted by output segments.
CHAPTER 3. FPGAS GLOBAL ROUTING 15
3.2 Physical Design Automation of FPGAs
The physical design automation of FPGAs involves mainly three steps which 
include partitioning, placement and routing.
3.2.1 Partitioning
Partitioning is the separation of the logic into Logic blocks. Partitioning has 
both a logical and physical component. The connections within a logic blocks 
are constrained by the limited routing architecture and limited number of 
blocks outputs. However, the quality of the resulting partitioning depends 
on how well the placement can be done. The logical component has been 
investigated in the context of technology mapping in logic optimization.
3.2.2 Placement
Placement starts with logic blocks and input-output blocks in partitioned 
netlist and decides which corresponding blocks on the chip should contain 
them. The FPGA placement problem is very similar to traditional standard 
cell and gate array placement problems. Many of existing algorithm place­
ment algorithms are applicable, such as simulated annealing, force directed 
relaxation and min-cut.
3.2.3 Routing
After placement of all circuit, each pin of any multipoint net have to be con­
nected. There are several routing algorithms for different kind of FPGA ar­
chitectures and routing problem in FPG.A’s is very complex as in standard 
cells and gate arrays designs. Because of simplicity, the routing problem can 
be divided into two step as in traditional routing problem: global routing and 
detailed routing.
Global routing in F P G A ’s can be done by using a global router for standard 
cell design. In general such a global router divides the multipoint nets into two 
terminal nets and routes them with minimum distance path. While doing so it
CHAPTER 3. ERG AS S¿. GLOBAL ROUTING 16
□ □ □
□ □ □ ° ,
□ □ □ □
□ □ □ A
(a)
□
□
□
□
□
□
□
□
□
□
□
(b)
□
□
□
Figure 3.5. General approach to FPGA routing a) Global routing b) Detailed 
routing
also tries to balance the density of channels. The global route defines________
route for each connection by assigning it a sequence of channel segments. After 
the paths are defined in terms of channel between two-pin connection detailed 
router chose specific wiring segments to implement the channel segment 
signed during global routing.
a coarse
as-
3.3 Global Routing Problem in Design Automation of 
FPGAs
A global router chooses channels for each net and leaves the task of allocating 
specific wiring segments and switches to detailed router. The global routing 
in F P G A ’s decides for each net to determine which pins are actually to be 
connected. The objective of global router is to minimize the sum of the channel 
densities of all channels. As in many studies, the routing problem in FPGA  
is solved by directly allocating the segments and ignore the global routing 
phase. There are unique global router for FPG A: PGAroute. This global 
router similar the global router for standard cells and use the LocusRoute 
global routing algorithm.
In the LocusRoute algorithm, the following three steps are executed for 
each multi pin nets.
1) Net’s Division: Each multi-pin net is divided into a set of two-pin con­
nections using a minimum spanning tree algorithm.
CHAPTER 3. ERG AS & GLOBAL ROUTING 17
Tf
S2
sr··
T2
n
Figure 3.6. Sample two bends routes
2) Route Generation and Evaluation: In this steps, the possible paths be­
tween each pin of two-pin nets are considered and evaluate this paths in terms 
of cost value and chose the lowest cost value path.
The method of choosing routes is based on paths that have two or less bends. 
LocusRoute evaluates a subnet of all two bend routes between the two physical 
pins and chose the one with the lowest cost. The cost function is defined in 
terms of the channel densities. Each wire segments and switch blocks are 
represented as elements of an array which is called as cost array. Each element 
of cost array Hij contains the number of routes that pass through the wire 
segment of {i,j). The cost of path(P) is calculated as
Cost(P) = y :  Hi,¡ (3.1)
3)Reconstruction: This step joins all two-pin connections back together, 
performs assigns unique numbers to distinct segments of some nets in each 
channel.
Locus routes uses the iterative technique, that after the first time all nets 
are routed, each is sequentially ripped up and rerouted. Iterations reduces the 
order dependency and also it improves the routing quality.
3.4 Model of FPGA for Global Routing
The form of commercial FPGA consists of a two dimensional regular array 
of programmable logic blocks (LB’s), a programmable routing network and
CHAPTER 3. FPGAS &c GLOBAL ROUTING IS
Vertical
^Channel Segment
• SR,
Horizontal 
Channel Segment
LB
SB: Switch Box 
LB: Logic Blocks
Figure 3.7. The FPGA model used for Global Routing
switch boxes (SB ’s) [3, 1 , 2 ]. Logic blocks are used to provide the function­
ality of a circuit. Routing network makes connections between LB’s and in- 
put/output pads. Routing network of FPGA consists of wiring segments and 
connection blocks. Wiring segments have three type of routing resources in 
the commercial SRAM  based FPG A [Ij: channel segments, long lines and 
direct-interconnections. A horizontal (vertical) channel segment consists of a 
number of parallel wire segments connecting two successive SB ’s in a horizontal 
(vertical) channel. The SB’s allow programmed interconnection between these 
channel segments. Direct-interconnection provides the connections between 
neighbor LB ’s. Long lines cross the routing area of FPGA vertically and hor­
izontally. Connection blocks provide the connectivity from the input/output 
pins of LB’s to the wiring segments of the respective channel segments. Each 
pin can be connected to a limited number of wiring segments in a channel and 
this is called a  ^ flexibility of connection block [16]. In this work, it is assumed 
that each LB pin can be connected to all wiring segments in the respective 
channels. Therefore, we can omit the connection block in our FPGA model.
Since the direct-interconnections are used by neighbor LB’s to provide min­
imum propagation delay and the long lines are used by signals which must 
travel long distances (i.e., global clock), these interconnection resources are 
not considered in the global routing. Hence, our FPGA model for global rout­
ing considers only the LB’s, SB ’s and channel segments. An FPGA can be 
modeled as a two dimensional array of LB’s which are connected to the verti­
cal and horizontal channel segments, and SB’s which make connections between
CHAPTER 3. FPGAS & GLOBAL ROUTIXG 19
Source LB
Source SB
LS-subnet
Target SB 
Target LB
f ^^^-.subnet
; f
·' ; (► *
I'i
t :
: ♦
(a)
SL-subnet 
(b)
Figure 3.8. (a) The routing area of the two-pin net and its subnets, (b) The 
possible routes for each subnets
the horizontal and vertical channel segments (Fig. 3.7).
In this work, we divide all multi-pin nets into two-pin nets using minimum 
spanning tree algorithm [19] as in LocusRoute. Hence, a net refers to a two- 
pin net here, and hereafter. Consider the possible routings for a two-pin net 
with a Manhattan distance of dh + dy where dk and denote the horizontal 
and vertical distances, respectively, between the two pins of the net on the 
LB grid. The routing area of this net is restricted to a (d/i-t-1 ) x (d„-t-l) LB 
grid as shown in Fig. 3 .8 .a. Then, the shortest distance routing of this net 
can be decomposed into three independent routings as follows. Each pin of 
this net has only one neighbor SB in the optimal routing area. Hence, each 
pin can be connected to its unique neighbor SB either through a horizontal 
or a vertical channel segment (Fig. 3.8). Meanwhile, the optimal routing area 
for the connection of these two unique SB’s is restricted to a xd„ SB grid 
embedded in the LB grid (Fig. 3.8). Hence, by exploiting this fact, we further 
subdivide each net into three two-pin subnets referred here as LS, SS and 
SL subnets (Fig. 3 .8 .b). Here, LS and 5L  subnets represent the LB-to-SB  
and SB-to-LB connections, respectively, and SS subnets represent the SB-to- 
SB connection for a particular net. Therefore, we consider only two possible 
routings for both LS and SL subnets and dh+dy — 2 possible one or two bend 
routings for SS subnets for routing the original net.
We define an FPGA graph F (L , 5 , C ) for modeling the global routing prob­
lem in FPGAs. This graph is a P x Q two-dimensional mesh where L, S and
CHAPTER 3. PPG AS & GLOBAL ROUTING 20
FPGA Graph
Rl: A possible route for SS-subnet
R2: Two possible routes for the LS-subnet ( )
R3: Two possible routes for the SL-subnet ( S ,L )
73 73
P’igure 3.9. The Cost Graph for FPG A model
C denote the set of LB ’s, SB’s and channel segments, respectively. Here, P 
and Q is the number of horizontal and vertical channels in the FPG A. Each 
grid point (vertex) Sp, of the mesh represents the SB at horizontal channel p 
and vertical channel q. Each cell Lpg of the mesh represents the LB which is 
adjacent to four SB ’s Spq, ■Sp.j+i, and ■Sp+i,,. Edges are labeled such
that the horizontal (vertical) edge (c"^) corresponds to the channel segment 
between the two consecutive SB’s Spq and 5p,,+i (sp+i,,) on the horizontal (ver­
tical) channel p (q), respectively. Figure 3.9 displays a 8 x 6 sample FPGA  
graph. Then, the pins of the LSISL and SS type subnets are assigned to the 
respective cell-vertex and vertex-vertex pairs of the graph as is in mentioned 
earlier.
The global routing problem reduces to searching for most uniform possible 
distribution of the routes for these subnets. The uniform distribution of the 
routes is expected to increase the likelihood of finding a feasible routing in 
the following detailed routing phase. Hence, we need to define an objective 
function which rewards balanced routings. VVe associate weights with the edges 
of FPGA graph in order to simplify the computation of the balance quality 
of a given routing. The weight w^ g (u>pq) of a horizontal (vertical) edge c^ g 
(Cpg) denotes the density of the respective channel segment. Here, the density 
of a channel segment denotes the total number of nets passing through that 
segment for a given routing. Using this model, we can express the balance
CHAPTI-R 3. FPGAS & GLOBAL ROUTISG 21
quality B oi a. given routing R  as
B (R ) =  E  E « ( R ) ) "  +  E  E K , ( R ) ) ^
P=1 q z = l  q  =  i  p = l
(3.2)
As is seen in Eq. (3.2), each channel segment contributes the square of its den­
sity to the objective function thus penalizing imbalanced routing distributions. 
Hence, the global routing problem reduces to the minimization of the objective 
function given in Eq. (3.2).
Chapter 4
MFA SOLUTION FOR GLOBAL 
ROUTING IN FPGA
This chapter investigates the routing problem in Static RAM  Field Pro­
grammable Gate Arrays (F P G A ’s) implementing the non-segmented (Xilinx 
based) network [27]. The architecture model of FPGA used for formulation 
and Mean Field Annealing formulation for global routing problem are given 
in this chapter. Details of experiments, the circuits used in experiments and 
results are shown at the end of this chapter.
4.1 MFA Formulation of Global Routing
The MFA algorithm is derived by analogy to Ising and Potts models which 
are used to estimate the state of a system of particles, called spins, in thermal 
equilibrium. In Ising model, spins can be in one of the two states represented 
by 0 and 1 , whereas in Potts model they can be in one of the K  states. All 
LS/SL subnets are represented by Ising spins since they have only two possible 
routes. In Ising spin encoding of each LS/SL subnet m, = I (0 ) denotes 
that the LB-to-SB or SB-to-LB routing is achieved through a single horizontal 
(vertical) channel segment. Each SS subnet n having !{„ >  2  possible routes 
is represented by a ATn-state Potts spin. The states of a /t"„-state Potts spin is 
represented using a Kn dimensional vector
Vn — [^nl) · · · ) n^rt · · ·  ^ '^n.A'nj (4.1)
99
CHAPTER 4. MEA SOLUTION FOR GLOBAL ROUTING IN FPGA 23
where “i” denotes the vector transpose operation. Each Potts spin v„ is allowed 
to be equal to one of the principal unit vectors e i , . . . ,  Cr, . . . ,  and can not 
take any other value. Principal unit vector 6 r is defined to be a vector which 
has all its components equal to 0 except its r-’th component which is equal to 
1 . Potts spin v„ is said to be in state r if v„ =  6 r. Hence, a A'„-state Potts 
spin v„ is composed of Kn two state variables t’n\, ■ ■. ,Vnr, ■ ■ ■ ,VnK„, where 
Vnr £ { 0 ) 1 }) with the following constraint
Kn
x ;  Vnr =  1
r=l
(4.2)
If Potts spin n is in state r (i.e., == 1 for 1 <  r <  /i „ )  we say that the
corresponding net n is routed by using the route r.
In the MFA algorithm, the aim is to find the spin values minimizing the 
energy function of the system. In order to achieve this goal, the average (ex­
pected) values (um) and (v„) — [(vni), · · · ,  {vnr), ■ ■■, (r’nA'„)]‘ of all Ising and 
Potts spins, respectively, are computed and iteratively updated until the system 
stabilizes at some fixed point. Note that for each Ising spin m, Um G { 0 , 1 }, i.e., 
Ujn can take only two values 0 and 1 , whereas (um) E [0 , 1 ], i.e., {um) can take 
any real value between 0 and 1 . Similarly, for each Potts spin n, Vnr €  {0 ,1 }  
whereas (u„r) E [0,1]. When the system is stabilized, (um) and (vnr) values 
are expected to converge to either 0 or 1 with the constraints J2^i(vnr) =  1 
for the Potts spins.
In order to construct an energy function it is helpful to associate the fol­
lowing meaning to the values (um) for LS/SL subnets.
(wm) =  ^(subnet m is routed by using the horizontal channel segment)
1 — (wm) =  ^(subnet m is routed by using the vertical channel segment)
That is, (um) and 1 — (wm) denote the probabilities of finding Ising spin m 
at states 1 and 0, respectively. In other words, (u^) and 1 — (u^) denote 
the probabilities of routing subnet m through a single horizontal and vertical 
channel segment, respectively. Similarly, for SS subnets represented with Potts 
spins
(vnr) =  '^(subnet n is routed through route r) for 1 <  r <  Kn (4.3)
That is, {vnr) denotes the probability of finding Potts spin at state r for 1 <  
r < Kn· In other words, (v„r) denotes the probability of routing net n through
CHAPTER 4. MFA SOLUTION FOR GLOBAL ROUTING IN FPGA 24
route r. Here and hereafter, Um and v„r will be used to denote the respective 
expected values i{um) and (r„r),respectively) for the sake of simplicity. Now, 
we formulate the total density cost of global routing problem as an energy term
Eb(U,V)
where
E E K i U )  +  <»i,(v)]^ +  z  E K i u )  +  (4.4)
p = l  q=l <7=1 p = l
w: = E  “ d < ( V )  = E  E
W.
ri3c^ q r€Rn,rBc^ q
;,(U) = E  (1 -  “») w;,(V) = E  E  «»r
n3 c", rgKn.rac;;,
where U  =  . . . }  and V  =  {v i, V2 , . . . }  represent the sets of Ising and
Potts spins corresponding to the LS/SL and SS subnets, respectively. For 
LS/SL subnets, “m 3 Cpq” denotes “for each LSfSL subnet m whose pair 
of pins share the horizontal or vertical channel segment Cp,” . For SS sub­
nets “n 3 Cpg” denotes “for each SS subnet n whose routing area contains 
the horizontal and vertical channel Cp,” . Furthermore, “r G Rn,'>' 3 Cpg” de­
notes “for each possible route r of SS subnet n which passes through the 
horizontal or vertical channel segment Cp,” . Here, Wp,(U) and Wpg(V) repre­
sent the probabilistic densities of the horizontal or vertical channel segment Cpg 
for the current routing states of LS/SL and SS subnets, respectively. Hence, 
Wpg ( U , V )  =  typq(U)-f ti)p,(V) represents the total probabilistic density of hor­
izontal or vertical channel segment Cpg for the overall current routing state.
Mean field theory equations, needed to minirhize the energy function Eb·, 
can be derived as
= -2K ,(U , V) -  < (U, V) -  2(u„ -  0.5)] (4.5)
where cj,, c”, € m
for an Ising spin m and
^„(U,V) = £s(U,V)|v.=o-£«(U,V)|v.=„ (4.6)
= -2( E  V) -  iw) + E  V) -  rv)|
c2oer
for 1 <  r <  Kn
for a Potts spin n, respectively. Mean field values (f>m and <^ „r can be interpreted 
as the increases in the energy function E b (U , V )  when Ising and Potts spins 
m and n are assigned to states 1 and r, respectively. Hence, —(f>m and —
CHAPTER 4. МЕЛ SOLUTION EOR GLOBAL ROUTING IN EPGA 25
may be interpreted as the decreases in the overall solution qualities by routing 
LS/SL and SS subnets m and n through the horizontal channel and route r, 
respectively. Then, u,n and values are updated such that probabilities of 
routing subnets m and n through horizontal channel and route r increase with 
increasing mean field values (f>m. and (f>nT as follows:
оФт/Т
Um =
n^r --
1 +
оФпг/т
Z t i
for r =
(4.7)
(4.8)
respectively.
After the mean field equations (Eqs. (4.5-4.6 )) are derived, the MFA algo­
rithm can be summarized as follows. First, an initial high temperature spin 
average is assigned to each spin, and an initial temperature T is chosen. Each 
Um value is initialized to 0.5 ±  Sm and each Vnr value is assigned to 1/Kn ±^nr 
where 6m and Snr denote randomly selected small disturbance values. Note 
that limT-^oo^m =  0.5 and limr_oo Vnr =  I/Nn- In each MFA iteration, the 
mean field effecting a randomly selected spin is computed using either Eq. (4.5) 
or Eq. (4.6). Then, the average of this spin is updated using either Eq. (4.7) 
or Eq. (4 .8 ). This process is repeated for a random sequence of spins until the 
system is stabilized for the current temperature. The system is observed after 
each spin update in order to detect the convergence to an equilibrium state for 
a given temperature. If energy function Eb does not decrease in most of the 
successive spin updates, this means that the system is stabilized for that tem­
perature. Then, T is decreased according to a cooling schedule, and iterative 
process is re-initialized. At the end of this cooling schedule, each Ising spin 
m is set to state 1  if Um >  0.5 or to state 0 , otherwise. Similarly, maximum  
element in each Potts spin vector is set to 1 and all other element are set to 0. 
Then, the resulting global routing is decoded as mentioned earlier.
4.2 Implementation
The performance of the proposed MFA algorithm for the global routing problem 
is evaluated in comparison with the well-known LocusRoute algorithm [24].
The MFA global router is implemented efficiently as described in Sec­
tion 4.1. Average of each Ising spin m is initialized by randomly selecting uinitm
CHAPTER 4. MFA SOLUTIOX FOR GLOBAL ROUTING IN FPGA 26
in the range 0.45 <  ^  0.55. Similarly, average of each Potts spin n is initial­
ized by randomly selecting A"„ values in the range 0 .9 //v „  <  
and normalizing =  u„r/ E *=i Vnk for r =  1 , 2 , . . . ,  K„. Note that random 
selections are achieved by using uniform distribution in the given ranges.
The initial temperature parameter used in mean field computation is esti­
mated using the initial spin averages values. Selection of initial temperature 
parameters Tq is crucial to obtain good routing. In previous applications of 
MFA, it is experimentally observed that spin averages tend to converge at a 
critical temperature. Although there are some methods proposed for the esti­
mation of critical temperature, we prefer an experimental way for computing 
To which is easy to implement and successful as the results of experiments 
indicate. We compute the initial average mean field as
Nm N„ K„ N„
•tci = ( E « ' '  + E E C r '‘)/(JVm + E*·»)
m = l n=l k=l n = l
Note that initial mean field values and are computed according to 
Eqs. (4.5) and (4.6) using initial spin values and Here, Nm and N„
denote the total number of Ising and Potts spins, respectiv'ely, where N = Nm +  
Nn denotes the total number of spins (subnets). Then, initial temperature is 
computed as To =  C(f>'Jl^ g where constant C is chosen as 540 for all experiments.
The cooling schedule is an important factor in the performance of MFA  
global router. For a particular temperature, MFA proceeds for randomly se­
lected unconverged net spin updates until AE < e for M  consecutive itera­
tions respectively where M = N initially and e =  0.05. Average spin values 
are tested for convergence after each update. For an Ising spin m, if either 
Um <  0.05 or Um >  0.95 is detected, then spin m is assumed to converge to 
state 0 or state 1, respectively. For a Potts spin n, if v„r >  0.95 is detected 
for a particular r = 1 ,2 ,..., K„, then spin n is assumed to converge to state 
r. The cooling process is realized in two phases, slow cooling followed by fast 
cooling, similar to the cooling schedules used for Simulated annealing. In the 
slow cooling phase, temperature is decreaised by T = a x T where a =  0.9 
until T <  To/1.5. Then, in the fast cooling phase, M  is set to M / 2 , o  is 
set to 0.8. Cooling schedule continues until 90Vc of the spins converge. At 
the end of this cooling process, each unconverged Ising spin m is assumed to 
converge to state 0 or state 1 if <  0.5 or Um >  0.5, respectively. Simi­
larly, each unconverged Potts spin n is assumed to converge to state r where 
Vnr =  niax{u„jt : k = 1,2 ,...,  A'n}. Then, the result is decoded as described in 
Section 4.1, and the resulting global routing is found.
CHAPTER 4. МЕЛ SOLİJTION EOR GLOBAL ROUTING IN EPGA 27
Table 4.1. M CNC benchmark circuits used in experiments
Benchmarks
Circuits
name
number
of
nets
number
of
2-pin nets
FPGA
size
9symml 71 259 10x9
too—large 177 519 14x13
apex? 124 300 11x9
example2 197 444 13x11
vda 216 722 16x15
alu2 137 511 14x12
alu4 236 851 18x16
ierjnl 87 202 9x8
C1355 142 360 12x11
C499 142 360 12x11
C880 173 427 13x11
K2 388 1256 21x19
Z03D4 575 2135 26x25
buscntl 145 392 12x11
dramfsm 389 1422 22x21
dma 197 771 17x15
z03 575 2135 26x25
The LocusRoute algorithm is implemented as in [24]. As the LocusRoute 
depends on rip-up and reroute method, LocusRoute is allowed to reroute the 
circuits 5  times. No bend reduction has been done as in [3]. Both algorithms 
are implemented in the C programming language.
4.3 Experimental Results
This section presents experimental performance evaluation of the proposed 
MFA algorithm in comparison with LocusRoute and Simulated Annealing (SA)  
algorithm. All algorithms are tested for the global routing of thirteen ACM 
SIGDA Design Automation benchmarks (M CN C) and four famous FPGA  
benchmark circuits on SUN SPARC 10 . The Table 4.1 illustrates the proper­
ties of these benchmark circuits.
These three algorithms yield the same total wiring length for global routing 
since two or less bend routing scheme is adopted in all of them. Necessary 
design automation process such as technology mapping and placement are done 
in University of Toronto by using Chortle technology mapper [11] and XAltor 
placement tools.
CUAFTER 4. MFA SOLUTION FOR GLOBAL ROUTING IN FFGA 28
Table 4.2. The Global Router results
Circuit MFA
Cost I Dens I time
PGA SA
Cost I Dens I time"Cost I Dens T time
Ssymml 1.0 12.0 0.36 1.032 14 0.00 1.000 12.0 20.64
toolargc 1.0 16.0 0.88 1.071 17 0.06 1.003 16.0 113.90
apcx7 1.0 14.0 0.42 1.073 16 0.00 0.935 14.0 31.46
ciampU‘2 1.0 15.0 0.64 1.097 16 0.02 0.856 15.0 76.54
vda 1.0 17.0 0.42 1.055 18 0.10 1.002 17.0 207.80
alu2 1.0 17.0 0.30 1.080 17 0.02 0.928 17.0 91.44
a/u4 1.0 17.0 0.68 1.073 19 0.10 0.966 17.0 288.78
tcrml 1.0 14.0 0.34 1.093 14 0.00 0.921 14.0 13.28
C1355 1.0 13.0 0.56 1.119 15 0.00 1.000 13.6 50.36
C499 1.0 15.0 0.48 1.075 16 0.00 1.003 15.0 44.58
C880 1.0 15.4 0.68 .065 17 0.04 0.933 16.8 74.40
k2 1.0 20.2 0.94 1.038 22 0.20 0.952 20.0 712.10
z03D4 1.0 17.0 2.34 1.117 18 0.30 1.000 17.0 1821.12
buscntl 1.0 13.0 0.42 1.050 13 0.00 0.998 13.0 54.92
drarnfsm 1.0 15.0 1.94 1.073 18 0.20 0.999 15.0 763.02
dma 1.0 15.0 1.96 1.084 16 0.10 0.972 15.0 216.80
203 1.0 20.0 2.10 1.119 21 0.30 1.000 20.0 1837.86
Table 4.2 illustrates the performance results of these three algorithms for 
the benchmark circuits. The MFA algorithm is executed 10 times for each cir­
cuit starting from different, randomly chosen initial configurations. The results 
given for the MFA algorithm in Table 4.2 illustrate the average of these execu­
tions. Global routing cost values of the solutions found by both algorithms are 
computed using Eq. (3.2) and then normalized with respect to those of MFA. 
In Table 4.2, maximum channel density denotes the number of routes assigned 
to the maximally loaded channels. That is, it denotes the minimum number 
of tracks required in a channel for 1 0 0 % routability.
As is seen in Table 4 .2 , global routing costs of the solutions found by MFA 
are 3.1% -10.5%  better than those of LocusRoute. As is also seen in this table, 
maximum channel density requirements of the solutions found by MFA are less 
than those of LocusRoute in almost all circuits except alu2 and terml. Both 
algorithms obtain the same maximum channel density for these two circuit.
How the global router distributes the channel densities, how the global 
router decreeise the maximum channel densities and how detailed router com­
pletes the routing arc some important metrics to measure the quality of the 
global routers. The propagation net delays, number of switch used , number of 
tracks in a channel are considered in comparison of global routers after com­
pletion of routing. The channel densities distribution affects on the number of 
tracks and switch also the propagation delay (because of number of switches) 
of the nets. In next paragraphs, the results of global routes are given in terms
CllAPTEli 4. A//v\ SOIA TION FOR GLOBAL ROUTING IN FPGA 29
Table 4.3. The SEGA detailed routing results in area optimization mode
Routing Info. Delay Info.
Circuit Total Segment Sheired Avg. Delay Max. Delay
MFA PGA Imp MFA PGA MFA PGA Imp MFA PGA
9syiiiml 674 711 5.20 42 85 5.06 5.56 9.01 63.38 57.97
tooig 1803 1951 7.59 47 114 13.83 15.10 8.45 125.48 122.80
apex7 960 1026 6.43 36 63 9.88 10.64 7.15 70.97 77.65
exp2 1775 1893 6.23 42 56 10.08 11.98 15.86 101.31 121.88
vda 2760 2950 6.44 70 176 18.67 20.58 9.30 140.77 170.36
alu2 1580 1674 5.62 36 129 9.82 9.61 -2.12 129.24 110.30
aJu4 3183 3424 7.04 67 203 16.58 17.08 2.93 153.88 163.30
terml 602 638 5.64 21 47 9.57 9.60 0.32 74.81 70.50
Cl 355 1299 1347 3.56 27 82 12.17 13.15 7.50 121.01 118.12
C499 1242 1296 4.17 37 82 11.64 12.02 3.15 79.75 94.46
C880 1575 1670 5.69 38 91 14.83 15.36 3.48 111.58 115.72
K2 5980 6323 5.42 88 306 25.77 27.54 6.43 244.35 229.54
Z03D4 7125 7700 7.47 227 555 12.75 13.60 6.26 190.62 191.65
bus-cntl 1128 1213 7.01 43 94 7.94 8.57 7.28 104.36 126.24
dr2un-fsm 4267 4648 8.20 174 403 6.19 6.68 7.35 140.61 157.05
dma 2300 2545 9.63 94 214 15.17 16.58 8.53 200.82 194.71
z03 7161 7870 9.01 267 533 13.05 14.40 9.39 193.18 192.93
of these metrics. The balance cost of SA and MFA global routers are not very 
different but the execution time of SA is 250 times longer than the MFA on 
the averages for all circuit.
The detailed router used in this experiments is called SEGA [20], for SEG- 
ment Allocator, and was developed specifically for SR AM  based F P G A ’s. The 
input of SEGA is a netlist of two point connections, which is output of the 
global router. To route the connections, SEGA allocates wire segments ac­
cording to cost function, baising its decisions on either of two goals: optimize 
for area or optimize for speed. For area optimization, only routability of the 
circuit is considered, which means the cost function focuses only on the task 
of successfully routing 100% of the connections in a circuit. In delay opti­
mization, SEGA selects the routes that have the best speed performance. The 
following assumption are done in experiments. All routing channels have an 
equal number of tracks. The flexibility of the channel blocks are equal to num­
ber of tracks. ( Each logic pin can connect to a channel with all tracks) The 
LocusRoute global routing algorithm used in PgaRoute global router (P G A). 
For further part of this chapter, PGA global router are used for LocusRoute 
algorithm [23].
The SEGA detailed router routes the nets by considering either area op­
timization or speed optimization criteria. Therefore all circuits are tested ac­
cording to these two optimization criteria, separately. The output of MFA and
CHAPTER 4. MEA SOLUTIOS EOR GLOUAL ROUTING IN EPGA 30
Table 4.4. The SEGA detailed routing results in speed optimization mode
Routing Info. Delay Info.
Circuit Total Segment Shared Avg. Delay Max. Delay
MFA PGA Imp MFA PG A MFA PGA Imp MFA PGA
9symml 653 649 -0.62 63 147 5.07 5.28 3.94 56.46 48.67
toolg 1776 1822 2.52 74 243 13.34 13.06 -2.17 128.56 106.00
apex7 942 952 1.05 54 137 9.73 9.86 1.28 70.97 63.32
exp2 1746 1762 0.91 71 187 10.01 10.81 7.40 95.27 98.10
vda 2704 2774 2.52 126 352 19.07 19.10 0.17 148.30 164.71
alu2 1533 1542 0.58 83 261 9.46 9.56 1.07 127.29 128.45
alu4 3132 3193 1.91 118 434 16.17 16.29 0.76 145.32 147.41
terml 591 592 0.17 32 93 9.74 8.13 -19.82 76.82 46.33
C1355 1277 1269 -0.63 49 160 12.34 11.69 -5.59 126.73 98.27
C499 1225 1222 -0.25 54 156 11.66 10.72 -8.81 81.49 83.71
C880 1552 1567 0.96 61 194 14.39 14.01 -2.73 106.94 106.06
K2
Z03D4
5900 5995 1.58 168
6965 7664 9.12 437
634
1191
27.05 26.50 - 2.10 262.23
12.42 12.34 -0.65 167.32
210.25
169.05
bus-cntl 1112 1114 0.18 59 193 8.03 7.95 -1.04 95.93 86.24
dram-fsm 4155 4305 3.48 286 746 6.05 6.61 8.54 140.61 146.57
dma 2243 2350 4.55 151 409 14.89 15.40 3.30 203.74 181.06
z03 6953 7205 3.50 475 1198 12.65 13.27 4.69 172.34 173.38
PGA global routers was used as a input of the detailed router. After that 
SEGA detailed router was executed in two different mode (area and speed op­
timization mode) for each benchmark circuit. The results of SEGA detailed 
router gives information about routing w’hich contains total number of segment, 
shared segment and minimum channel width for 10 0 % routing, and propaga­
tion delay which contains average and maximum delay of the nets. Therefore, 
quality of MFA and PGA global routers are compared by considering these 
routing and delay information.
Table 4.3, Table 4 .4  and Table 4.5 shows the results of SEGA detailed router 
whose inputs were constructed by MFA and PGA routers. Table 4.3 represents 
the results for area optimization mode and Table 4.4 represents the results for 
speed optimization mode. As seen in Table 4.3, MFA needs less number of 
segment that PGA for complete routing. There are 3% -9%  improvement in 
total number of segment used in complete routing. Also MFA causes less 
propagation delay than MFA for all benchmark circuits as in Table 4.3. The 
average delay for routing are decreased by 3% -15%  for MFA according to PG A. 
If we consider the number of tracks in a channel, MFA needs small channel 
width in 6 benchmarks, but PGA routes 8 benchmarks with less number of 
tracks than MFA. For other benchmarks circuit both PGA and MFA need same 
channel width as seen in Table 4.5 Finally we can say that MFA global router 
produces better results that PGA global router according to area optimization. 
Because, MFA can distribute the channel density more that PGA.  Also SEGA
CHAPTER 4. МЕЛ SOLI TIOS EOR GLOBAL ROUTING IN EPGA 31
Table 4.5. Miniinun Channel Width for 100% routing
Channel Widht (W)
Circuit Area Opt. Mode Speed Opt. Mode
MFA PGA MFA PGA
9symml 10 10
toolg 13 11 13 12
apex7 11 13 12 15
exp2 13 17 14 19
vda 13 16 16 16
alu2 13 10 13 12
alu4 14 13 15 15
terml 10 11 10
C1355 10 12 12 12
C499 13 11 14 11
C880 12 13 13 14
K2 15 16 19 19
Z03D4 14 14 15 15
bus-cntl 10 10 11 11
dram-fsm 13 11 13 13
dma 11 11 12 13
z03 16 14 16 16
detailed router results in speed optimization mode as in Table 4.4 shows that 
there are also improvement in both total number of segment, channel width 
and average delay. But the percent of improvement is less than those of area 
optimization mode. Note that P G A  can cause less maximum delay than MFA  
for most of circuits.
Also the channel width is important criteria for routing because its affect 
on the size of FPGAs. In Table 4.5 the minimum number of track (channel 
width) in a channel are shown for both area and speed optimization mode. As 
in this table, for some circuits, M FA gives better results but some circuits PGA  
gives better results, therefore the M F A ’s and P G A ’s performance on channel 
width are very similar.
Figures 4.1 and 4.2 contain visual illustrations as pictures (left) and his­
tograms (right) for the channel density distributions of the solutions found by 
MFA and LocusRoute, respectively, for the circuit C1355. The pictures are 
painted such that the darkness of each channel increases with increasing chan­
nel density. Global routing solutions found by these two algorithms are tested 
by using SEGA detailed router for FPGA. Figure 4.3 illustrates the results of 
the SEGA detailed router for the circuit C1355
CHAPTER 4. MFA SOLUTION TOR GLOBAL ROUTING IN FPGA 32
Figure 4.1. Channel density distribution obtained by MFA for the circuit C1355
Figure 4 .2 . Channel density distribution obtained by LocusRoute for the circuit 
C1355
(«)
fiSjSSSSilSii 
s f e i s E s s s i  
EIS6!@jSiiS! 
f i s a n i g s i s  
15№ tSSSI№
(¿)
Dl
[□Ld
'□j-O
a- -□
-O -□
Figure 4.3. SEGA detailed router results of the circuit Cl355 for the global 
routing solutions obtained by (a) MFA (b) LocusRoute
Chapter 5
THE M APPING PROBLEM
This chapter introduces the mapping problem in parallel processing and its 
application.
5.1 The Mapping Problem
Use of parallel computers in various applications, makes the problem of map­
ping parallel programs to parallel computers more crucial. The mapping prob­
lem arises while developing parгdlel programs for distributed-memory, message­
passing parallel computers (multicomputers). In multicomputers, processors 
neither have shared memory nor have shared address space. Each processor 
can only access its local memory. Synchronization and coordination among 
processors are achieved through explicit message passing. Processors of a mul­
ticomputer are usually connected by utilizing one of the well-known direct 
interconnection network topologies such as ring, mesh, hypercube, etc. These 
architectures have the nice scalability feature due to the lack of shared resources 
and the increasing bandwidth with increasing number of processors.
However, designing efficient parallel algorithms for such architectures is not 
straightforward. An efficient parallel algorithm should exploit the full potential 
power of the architecture. Processor idle time and the interprocessor commu­
nication overhead may lead to poor utilization of the architecture and hence 
poor overall system performance. Processor idle time arises due to the uneven 
load balance in the distribution of the computational load among processors 
of the multicomputer. Parallel algorithm design for multicomputers can be
.33
CHAPTER 5. THE MAPPING PROBLEM 34
divided into two phcises; first phaise is the decomposition of the problem into a 
set of interacting sequential sub-problems (or tasks) which can be executed in 
parallel. Second phase is mapping each one of these tasks to a processor of the 
parallel architecture in such a way that the total execution time is minimized. 
This mapping phase, named as the mapping problem [4], is very crucial in 
designing efficient parallel programs.
For a class of regular problems with regular interaction patterns, the map­
ping problem can be efficiently resolved by the judicious choice of the decompo­
sition scheme. In such problems, chosen decomposition scheme yields an inter­
action topology that can be directly embedded to the interconnection network 
topology of the multicomputer. Such approaches can be referred as intuitive 
approaches. However, intuitive mapping approaches yield good results only for 
a restricted class of problems, under simplifying assumptions. The mapping 
problem is known to be NP-hard [13]. Hence, heuristics giving sub-optimal so­
lutions are used to solve the problem [4, 13, 21]. Two distinct approaches have 
been considered in the context of mapping heuristics, one-phase approaches 
and two phase approaches. One-phase approaches, referred to as many-to-one 
mapping, try to map tasks of the parallel program directly onto the processors 
of the multicomputer. In two phase approaches, clustering phase is followed 
by a one-to-one mapping phase. In the clustering phase, tasks of the parallel 
program is partitioned into as many equal weighted clusters as the number 
of processors of the multicomputer, while minimizing the total weight of the 
inter-cluster interactions [21]. In the one-to-one mapping phase, each cluster 
is assigned to an individual processor of the multicomputer such that total 
inter-processor communication is minimized [2 1 ].
In two phase approaches, the problem solved in the clustering phase is 
identical to the multi-way graph partitioning problem. Graph partitioning is 
the balanced partitioning of the vertices of a graph into a number of bins, such 
that the total cost of the edges in the edge cut set is minimized. Kernighan- 
Lin (KL) heuristic [10, 17] is an efficient heuristic, originally proposed for the 
graph bipartitioning problem, which can also be used for clustering [21]. KL  
heuristic is a non-greedy, iterative improvement technique that can escape from 
local minima by testing the gains of a sequence of moves in the search space 
before performing them. A variant of the KL heuristic can be used for solving 
one-to-one mapping problem encountered in the second phase [15].
Simulated Annealing (SA) can also be used as a one phase heuristic for
CHAPTER 5. THE MAPPISG PROBLEM 35
solving many-to-one mapping problem [15, 28]. Successful applications of SA to 
the mapping problem is achieved in various works [15, 28]. It has been observed 
that the quality of the solutions obtained using SA are superior compared with 
the results of the other heuristics.
5.2 The Model of Mapping Problem
In various classes of problems, interaction pattern among the tasks is static. 
Hence, the decomposition of the algorithm can be represented by a static task 
graph. Vertices of this graph represent the atomic tasks and the edge set 
represents the interaction pattern among the tasks. Relative computational 
costs of atomic tasks can be known or estimated prior to the execution of the 
parallel program. Hence, weights can be associated with the vertices in order 
to denote the computational costs of the corresponding tasks.
There are some model to model the static task interaction pattern. One of 
the model is Task Interaction Graph (TIG ) model. In the TIG model, inter­
action patterns are represented by undirected edges between vertices. In this 
model, each atomic task can be executed simultaneously and independently. 
Each edge denotes the need for the bidirectional interaction between corre­
sponding pair of tasks at the completion of the execution of these tcisks. Edges 
may be associated with weights which denote the amount of bidirectional in­
formation exchange involved between pairs of tasks. TIG usually represents 
the repeated execution of the tasks with intervening task interactions denoted 
by the edges.
The TIG  model may seem to be unrealistic for general applications since it 
does not consider the temporal interaction dependencies among the tasks [26]. 
However, there are various classes of problems which can be successfully mod­
eled with the TIG model. For example, iterative solution of systems of equa­
tions arising in finite element applications [7, 26] and power system simula­
tions, and VLSI simulation programs [28] are represented by TIGs. In this 
work, problems which can be represented by the TIG model are addressed.
In order to solve the mapping problem, parallel architecture must also be 
modeled in a way that represents its architectural features. Parallel architec­
tures can easily be represented by a Processor Organization Graph (POG), 
where nodes represent the processors and edges represent the communication
CHAPTER 5. TUE MAPPISG PROBLEM 36
links.
In a multicomputer architecture, each adjacent pair of processors commu­
nicate with each other over the communication link connecting them. Such 
communications are referred as single-hop communications. However, each 
non-adjacent pair of processors can also communicate with each other by means 
of software or hardware routing. Such communications are referred as multi­
hop communications. Multi-hop communications are usually routed in a static 
manner over the shortest path of links between the communicating pairs of 
processors. Communications between non-adjacent pairs of processors can be 
associated with relative unit communication costs. Unit communication cost 
between a pair of processors will be a function of the shortest path between 
these processors and the routing scheme used for multi-hop communications. 
For example, in software routing, the unit communication cost is linearly pro­
portional to the shortest path distance between the pair of communicating 
processors. Hence, the communication topology of the multicomputer can be 
modeled by an undirected complete graph, referred here as Processor Com ­
munication Graph (PCG ). The nodes of PCG represent the processors and 
the weights associated with the edges represent the unit communication costs 
between pairs of processors. As is mentioned earlier, PCG can easily be con­
structed using the topological properties of POG and the routing scheme uti­
lized for inter-processor communication.
The objective in mapping TIG  to PCG is the minimization of the expected 
execution time of the parallel program on the target architecture. Thus, the 
mapping problem can be modeled as an optimization problem by associating 
the following quality mecisures with a good mapping : (t) interprocessor com­
munication overhead should be minimized, (ii) computational load should be 
uniformly distributed among processors in order to minimize processor idle 
time.
A mapping problem instance can be formally represented with two undi­
rected graphs. Task Interaction Graph (TIG) and Processor Communica­
tion Graph (PCG). The TIG Gt{V,EJ), has |U| =  N vertices labeled as 
( 1 , 2 , . . . ,  ¿, . . . ,  A'). Vertices of the Gj represent the atomic tasks of the
parallel program. Vertex weight u>, denotes the computational cost associated 
with task i for I < i < N. Edge weight e,j denotes the volume of interaction 
between tasks i and j  connected by edge ( f , j )  € E. The PCG Gp{P,D),
CHAPTER 5. THE MAPIHNC PROBLEM 37
is a complete graph with |P| =  K  nodes and |£)| =  ( ) edges. Nodes of
the Gp, labeled as ( 1 , 2 , . . . ,  p, ç , . . . ,  A '), represent the processors of the target 
multicomputer. Edge weight dpg, for I < p.q < N and p ^ q, denotes the unit 
communication cost between processors p and q.
Given an instance of the mapping problem with the TIG G t { V , E )  and 
the PCG Gp{P,D), the question is to find a many-to-one mapping function 
M : V P, which assigns each vertex of the graph Gt to a unique node of the 
graph Gp, and minimizes the total interprocessor communication cost (CC)
CC =  e,jdAi(i)A/o) (5.1)
while maintaining the computational load {CLp : computational load of pro­
cessors p)
CLp = wi, l < p < K  (5.2)
ie v ,\i{ i)= P
of each processor balanced. Here, M(i) =  p denotes the label (p) of the proces­
sor that task i is mapped to. In Eq. (5.1), each edge (i, j )  of the Gt contributes 
to communication cost {CC), only if vertices i and j  are mapped to two differ­
ent nodes of the Gp, i.e. M{i) ^  M{j). The amount of contribution is equal 
to the product of the volume of interaction e,j between these two tasks and 
the unit communication cost dp, between processors p and q where p =  M{i) 
and q =  M{j). The computational load of a processor is the summation of 
the weights of the tasks assigned to that processor. Perfect load balance is 
achieved if CLp = {Yl^-iWi)/K for each p, I <  p < K. Computational load 
balance of the processors can be explicitly included in the cost function using 
a term which is minimized when all processor loads are equal. Another scheme 
is to include load balance criteria implicitly in the algorithm.
In Figure 5.1, an example for mapping problem are shown. The T IG  graph 
is in Fig. 5 . 1 .a and a corresponding mapping instance is in Fig. 5.1.b
CHAPTER 5. THE MAPPISG PROBLEM 3S
Cutsize =  27
(b)
Figure 5.1. An example of mapping problem
Chapter 6
MFA SOLUTION FOR MAPPING
In this chapter, the general MFA formulation and a new efficient MFA for­
mulation for mapping problem in mesh and hypercube type multicomputer cure 
proposed. The experimental results for randomly generated mapping instances 
and real problem instances are shown at the end of this chapter.
6.1 General MFA Formulation for Mapping Problem
The MFA algorithm is derived by analogy to Ising and Potts models which 
are used to estimate the state of a system of particles, called spins, in thermal 
equilibrium. In Ising model, spins can be in one of the two states represented 
by 0 and 1 , whereas in Potts model they can be in one of the K  states. In 
this work we use the Potts model. In the K  state Potts model of S spins, the 
states of spins are represented using S A'-dimensional vectors
S,· =  [s,i, . . . ,  s,7j , . . . ,  5 ,7c] for i — 1,2, . . . , 5 .
where “t” denotes the transpose operation. The spin vector S, is allowed to 
be equal to one of the principal unit vectors e j , . . .  , e ^ , . . .  ,e /c , and can not 
take any other value. Principal unit vector e* is defined to be a vector which 
has all its components equal to 0 except its l’ ’ th component which is equal to 
1 . Spin S, is said to be in state k if S, =  e^ t. Hence, a A-state Potts spin 
S, is composed of K  two state variables {s ,fc }^ j , where s,/t 6  { 0 , 1 } ,  with the 
following constraint
K
Y^Sik =  1 for t =  1 , 2 , . . . , 5 ' .
Jt=l
39
(6.1)
CHAPTER 6. MFA SOLUTION FOR MAPPING 40
In the general encoding of the mapping problem, each spin vector corresponds 
to a vertex of the TIG G{T, I). Hence, number of spins vectors is 5  =  ¡r i =  N. 
Dimension K  of the spin vectors is equal to the number of processors. If a spin 
is in state k (i.e., 5,jt =  1 ) we say that the corresponding tcisk is assigned to 
processor k.
In the MFA algorithm, the aim is to find the spin values minimizing the 
energy function of the system. In order to achieve this goal, the average (ex­
pected) values V , =  (S,) of each spin vector S, is computed and iteratively 
updated until the system stabilizes at some fixed point.Hence, w’e define
V,· =  [y.i , . . . ,  Vik, . . . ,  v .x f  =  (S .) =  [ (s . i ) , . . . ,  (s,-fc),. . . ,  (s,A-)]‘ (6.2)
That is, Vik =  for i =  1 , 2 , . . . , 5  and k = 1,2,... ,K.  Note that, s,jt €  
{ 0 , 1 } , i.e., Sik can take only two values 0 and 1 , whereas Vik €  [0 , 1 ], i.e., u.t 
can take any real value between 0 and 1 . As the system is a Potts glass we 
have the following constraint similar to Eq. (6.1)
K
Jt=l
for t =  1 , 2 , . . . ,  A^ (6.3)
This constraint guarantees that each Potts spin S, is in one of the K  states at 
a time, and each task is mapped to only one processor. In order to construct 
an energy function it is helpful to associate the following meaning to the values 
Vik] Vik =  'P(task i is mapped to the processor k ) (or i = 1,2,... ,N, and 
k =  1,2,..., K. That is, is the probability, of finding spin i at state k. If 
Vik =  1 then spin i is in state k and the corresponding configuration is S, =  V ,.
Now, we formulate the communication cost of the mapping problem as an 
energy term
K  K
=  E e,^ EE'i*iWask i is mapped to processor k)
(i,j)€/ k=\ tjik
P(task j  is mapped to processor /)
N K
^E E EE^ u VikVjidki
^ i=l jeAdj(i) k=l Ijtk
(6.4)
where V  =  | V „ . . . , V , ......... V ^]* is the spin average matrix consisting of N
A'-dimensional spin vectors as its rows. Here, Adj{i) denotes the set of tasks 
connected to task i in the given TIG . Minimization of corresponds to the 
minimization of the communication cost of the mapping problem. Another
CUAPTEli 6. MFA SOLVTIOA' FOR MAFFISG 41
term of the energy function is the term for penalizing imbalanced mappings.
 ^ N  N
E^(V) =  2  ^  ^  WiWjV{ tasks i and j  are mapped to the same processor)
1
1=1 jjij
N N K= -EE W{Wj P ( task i is mapped to processor k)
 ^ ,= 1 k=l
V{ task j  is mapped to processor k)
 ^ N N  K
=  2
,= 1  j>ii A:=l
(6.5)
This triple summation term computes the summation of the inner products of 
the weights of the tasks assigned to individual processors. Global minimum 
of this term occurs when equal amounts of task weights are assigned to each 
processor. If there is an imbalance in the mapping, term increases with the 
square of the amount of the imbalance, penalizing imbalanced mappings. The 
total energy function E is be defined in terms of E^ and E^ as
£ ;(V ) =  ^ ^ (V )  +  pE^{W) (6.6)
where parameter ^ is introduced to maintain a balance between the two op­
timization objectives of the mapping problem. Mean field theory equations, 
needed to minimize the energy function E, can be derived as
, dB(V) V '  J o ·^
¡tik =  — 5—  =  “  E  E  '■><'«>’>' - P L ·  “ '.» y 'j i (6.7)
The quantity represents the A:’th element of the mean field vector ef­
fecting on spin k. Using the mean field values average spin values v,* can 
be updated using the Boltzmann distribution as
Vik =  for f =  l , 2 , . . . , . V , f c  =  1,2, .  . . , / i '  (6 .8 )
where T is the temperature parameter which is used to relax the system itera­
tively. Equation (6 .8 ) handles the constraints given in Eq. (6.3) thus enforcing 
each Potts spin S, to be in one of the K state when they converge.
In Eq. (6.7), the first and second summation terms represent the increases 
in the total communication and imbalance costs, respectively, by mapping task 
i to processor k. Hence, —d>ik may be interpreted as the decrease in the overall 
solution quality by assigning task i to processor k. Then, in Eq. (6 .8 ), is up­
dated such that the probability of mapping task i to processor k increases with
CHAPTER 6. MFA SOLUTIOS FOR MAPPIAG 42
increasing mean field (^ ik. After the mean field theory equations are derived 
(Eq. (6.7), Eq. (6 .8 )), MFA algorithm can be summarized cis follows. First an 
initial, high temperature, spin average is assigned to each spin, and an initial 
temperature is chosen. At each temperature, starting with initial spin averages, 
the mean field vector effecting on a randomly selected spin is computed using 
Eq. (6.7). Then, spin average vector is updated using Eq. (6 .8 ). This process 
is repeated for a random sequence of spins until the system is stabilized for the 
current temperature. Then, T is decreased according to the cooling schedule, 
and iterative process is re-initiated. In [6] we have proposed an efficient im­
plementation scheme which cisymptotically reduces the complexity of a MFA  
iteration to Q{davgK -f K^) where davg denotes the average vertex degree in 
the TIG.
6.2 Interconnection-Topology Specific MFA Formula­
tion for Mapping
In this section, we proposed efficient Mean Field Annealing formulation for 
Mesh-connected and Hypercube-connected architecture.
6.2.1 MFA formulation for Mesh-Connected Architec­
tures
Consider a, P hy Q two-dimensional mesh-connected architecture with P rows 
and Q columns. The encoding in the general M FA formulation summarized in 
Section 6 .1  necessitates N xK  =  NxPxQ  variables for the problem represen­
tation. In this section, we propose a MFA formulation for the mesh-connected 
architectures which exploits the conventional routing scheme in mesh intercon­
nection topologies to introduce a much more efficient encoding scheme. Note 
that, the communication distance between any two processors is equal to the 
Manhattan distance between those two processors on the processor grid. Hence, 
the unit communication cost between any two processors can be expressed as 
the sum of two components: horizontal and vertical communication costs. Hor­
izontal and vertical unit communication costs are equal to the column and row 
distances between the processor pairs, respectively. Thus, any edge (i,j) € /
CHAPTER 6. MEA SOU TIOS FOR MAPPISG 43
with weight Cij of the TIG will contribute
Efj = Eij +  Eij = 6ij X |co/uf7in(i) — column{j)\ +  e,j x |ror/7(r) — row{j)\ (6.9)
to the total communication cost, where row{i) and œlumn(i) denote the row 
and column indices of the processor that tcisk i is mapped to and | · | denotes the 
absolute value function. Here, E^ 'j and E^ j denote the horizontal and vertical 
communication costs due to edge (r , j )  €  /  of the TIG. Hence, the row and 
column mappings of each task are sufficient for efficient computation of the 
inter processor communication cost in mesh-connected architectures.
Encoding
In the proposed encoding, we use two Potts spins of dimensions P and Q 
for each vertex (task) of the TIG . Spins of dimensions P and Q are used to 
encode the row and column mappings of the tasks, respectively. Note that this 
encoding also constructs a one-to-one mapping between the configuration space 
of the problem domain and the spin domain. However, it is much more efficient 
since it uses a total of Nx{P + Q) tw-o-state variables instead of N xPxQ  two 
state variables of the general! encoding. Spins with dimensions P and Q are 
called row and column spins which are labeled as S,· =  [s-’j , . . .  ,s[p, . . .  ,sjp ]‘ 
and S,· =  [«ii , . .  · respectively, for t =  1 , 2 , . . . ,  A'’. If a row
(column) spin is in state p (q) we say that the corresponding task is mapped 
to row p (column ?). Hence, sjp =  1 (s^, =  1 ) means that taisk i is mapped to 
row p (column q) of the mesh. That is, if s[p =  1 and =  1 , this means that 
task i is mapped to processor pq in the mesh. Here, processor pq identifies the 
processor at row p and column q of the mesh.
Energy Function Formulation
The following spin average vectors are defined for the sake of energy function 
formulation.
v; = [o',....o,',....,o'p|' = (sf) =
v; = (oi„...,or,,...,oj,j' = (si) = [«.),...,(4)....(»;,)]'
Note that, sjp, € {0 , 1 } ,  i.e., sjp and are discrete variables taking only two 
values 0 and 1 , whereas u[p, E [0 , 1 ], i.e., ujp and if, are continuous variables
CHAPTER 6. MFA SOLÂITION FOR MAFFIAC 44
taking any real value between 0 and 1 . As the system is a Potts glass we have 
the following constraints similar to Eq. (6.3)
p=l
(6. 10)
?=i
These constraints guarantee that each Potts spin S· (S·) is in one of the P 
(Q) states at a time, and each tcisk is cissigned to only one row (column) for 
the proposed encoding. In order to construct an energy function it is helpful 
to associate the following meanings to the v[p and values,
[p =  ■p(ta5 k i is mapped to one of the processor in row p),
iq — ^(task i is mapped to one of the processor in column q) (6 . 1 1 )
for i =  1 , 2 , . . .  Af, p =  1 , 2 , . . . ,  P  and q =  1,2,  —  That is, (vf^ ) denotes 
the probability of finding row (column) spin t in row p (column q). Formulation 
of horizontal communication cost due to edge (t,j) of the TIG as an energy 
term is:
4 j)
Q -l Q
e o E
k=l i= k + l
x{'P(tasks i and j  are mapped to columns k and /, respectively ) +  
P(tasks j  and i are mapped to columns k and /, respectively)}
=  E E C -  +  ·>>?.) (6 . 1 2 )
k = l t = k + l
Similarly, energy formulation for the vertical communication cost due to edge
CJ) is
E E C -  + v;iv,1) (6.13)
k = \ l = k + l
The derivation of the mean field theory equation using the formulation of the 
energy terms and E^ -j·^  given in Eqs. (6.12) and (6.13) results in sub­
stantially complex expressions. Hence, we simplify the expressions for 
and in order to get more suitable expressions for the mean field theory 
equations. A close examination of Eqs. (6.12) and (6.13) reveals the symmetry 
between the expressions for E^ -j·^  and E^ -j^  terms which can be obtained from 
each other by interchanging ”r” with ”c” and ” P” with ” Q ”. Hence, algebraic 
simplifications will only be discussed for the E^ j^^  term. Similar step can be 
followed for the E^  ^j^  term.
CHAPTER 6. MFA S0LUT10\ FOR MAPPISG 45
We introduce the following notation for the sake of simplification of the 
communication cost terms:
=  = = =  (6.14)
;=i ¡ - k  /=1 i - k
Here, F^ f. and denote the probabilities that task i is mapped to one of 
the processor in the first k columns (i.e., columns 1 , 2 , 3 , . . . ,  A*) and the last 
(J—A:+l columns (i.e., columns 1 , . . . ,  Q), respectively. Similarly, and
denote the probabilities that task i is mapped to one of the processors in 
the first k rows and the last P —¿ +  1 rows, respectively. Using this notation 
and thru some algebraic manipulations the expression for simplifies as :
4 . »  =  ' o ( E  E  ( ' -  +  E  E  ( ' -
k - l  l= k + l  l=k+l
Q - l  Q Q Q -\  Q Q
= ' . H E  E  E » :< .» J -  +  E  E  E » H < . . )
k = l l=k+l m = l k = l l=k+l m=l
Q - l  k Q Q - l  k Q
t=l /=1 m = k + l  k=l 1=1 m=Jt+l
Q -l k Q Q - l  k Q
=  ^ o i E E ^ i  E  * ^ > m + E E ^ i /  E
Ar=l /=1 m=JH-l fc=l 1=1 m = k + l
=  ‘ li E  + ' . i  E
k = l k=l
Q - l
=  'u +  F f A - K )  (« >5)
Jb=l
Similarly, the expression for simplifies to
p -i
k=l
(6.16)
We formulate the energy term corresponding to the imbalance cost using 
the same inner product approach adopted in the general formulation (Eq. (6.5)) 
as follows:
 ^ N  N
^  ^  tu,tUjP(task i and j  are mapped to the same processor)
2  .= 1  j / .
 ^ N  N  r  w
=  s E E  W{Wj E  E  i is mapped to the processor pq)
2 ,= 1  p=l q=l
V{ task j  is mapped to the processor pq)
 ^ i=l jyi P=1 7=1
P Q
CHAPTER 6. МЕЛ SOLUTION EOR MAPPING •Ш
Total energy term can be defined in terms of the communication cost terms 
and the imbalance cost term as
V*=) =  E\V^=) + £ " ( V*·) +  ¡3E^{V^, V^) (6.18)
Here, V  =  [ V I , . . . ,  V ; , . . . ,  V ;>]‘ and V<= =  [ VJ , . . . .  V,^ . . . ,  V ^ ] ‘ denote the 
row and column spin-average matrices consisting of .V, P and Q dimensional 
vectors as their rows, respectively.
D erivation  o f the M ea n  F ield  T h eo ry  E qu ation
The expected values V [  and V,^ of each row and column spins S[ and are 
iteratively updated using the Boltzmann distribution as
(«)
Ф^р/Т^
Фй./Г'· ( 6)
.Ф,<,/Т‘
ЕГ=1 с У (6.19)
for р =  1 , 2 , . . . ,  Р  and q = 1,2,... ,Q, respectively. Here, T ’’ and denote the 
temperature parameters used for annealing the row «ind column spin updates 
respectively. Recall that, the number of states of the row and column spins 
are different (P and Q for row and column spins, respectively) in the proposed 
encoding. As the convergence time and the temperature parameter of the 
system depends on the number of states of the spins we interpret the row and 
column spins as different system, i.e., the temperature parameter of the row 
and column spins are different. Note that, Eqs. (6.19.a) and (6.19.b) handle 
the constraints given in Eq. (6.10) thus enforcing each row and column Potts 
spins SI and Sf to be in one of the P and Q states when they converge. In the 
proposed MFA formulation, row and column spins are updated in an alternative 
manner, i.e., each row spin update is followed with a column spin update and 
vice versa. MFA iterations in which row and column spins are updated will be 
referred here as row and column iterations, respectively.
In the proposed formulation, row and column mean field vectors Ф[ 
and Ф1 are to be computed in row and column iterations, respectively. 
Each element and of the row and column mean field vectors Ф,· =  
[ФЬ,···, Ф"р, ■■■, Ф{рУ and Ф1  =  ФЧдУ experienced by row and
column Potts spins i denote the decrease in the energj’ function by assigning S[ 
to 6 p and S? to e ,, respectively. Hence, —ф]р {—ф%) may be interpreted as the 
decrease in the overall solution quality by mapping task i to row p (column q). 
In other words, —φip (—ф{д) corresponds to the increase in the energy function
CHAPTER 6. MFA SOLUTION FOR MAPPISG 47
by mapping task i to row p (column ^). Then, in Eq. (6.19.a) (Eq. (6.19.b)), 
is updated such that the probability of mapping task t to row p (column 
q) increases with increasing mean field value Using the simplified
expressions for the proposed energy function in Eqs. (6.15), (6.16) and (6.17)
=  _ d ^ ( V % y :) ^ ^ r ( C )  r(B)
*p
JeAe(;(t) J =  i j Y i  7=1
JÇ
_  aH(V’ , v ‘ ) _  .(c| .(B)
(6.20)
«7
N P
-  E  e . j Z ; , 5 ] )  (6 -2 1 )
ie>44)(i) p=i
where Z i  =  y : ‘ i 7 , +  f ;  i ; »  and / ‘,  =  E  hji +  E  i j i
/;=! ^=p+l k=l k=q+l
As seen in Eqs.(6.20) and (6.21), different balance parameters P'' and are 
used in the mean field computations of row and column iterations since row and 
column spins are interpreted as different system. Figure 6.1 illustrate the MFA  
algorithm proposed for the mapping problem for mesh-connected architectures. 
Note that, each iteration of the inner while-loop (step 3.1) involves one row and 
one column iteration. Also note that the computation of the energy differences 
A E '’ and necessitates computing E in Eq. (6.18) twice at each iteration 
of the inner while-loop which drastically increases the complexity of a MFA  
iteration. Here, AE'’ and AE^ represent the energy differences due to the row 
and column spin updates, respectively. As is seen at Step 3.1.5 we use the 
efficient energy difference computation scheme which we have proposed for the 
general MFA formulation [6].
An Efficient Implementation Scheme
As mentioned earlier, the proposed MFA algorithm is an iterative process. The 
complexity of a single MFA iteration is due mainly to the mean field compu­
tations. As is seen in Eqs. (6.20) and (6.21), calculation of mean field values is 
computationally very intensive. In this section, we propose an efficient imple­
mentation scheme which reduces the complexity of mean field computations.
CHAPTER 0. MFA SOLITIOS TOR MAPPISC ‘18
1 . Get the initial temperatures Tq T^q  ^ and set T'^  = Tq T^^  = Tq
2. Initialize the spin averages V*· =  [r[ j , . . . ,  lAf. , . . . .  t-yp] 
and =  [ufi, . . . ,  uffc,. . . ,
3. W HILE temperature T"" and is in the cooling range DO
3.1 W H ILE and are decreasing DO
3.1.1 Select tcisks i and j  at random for horizontal and vertical 
spins, respectively.
3.1.2 Compute mean field vectors $,· and experienced by 
row and column Potts spins i and j.
<f>\p — ~  ^ h e A d j{ i )  ^ h ^ ’hp Y^q=l '^iq^hq
<f>% = -  E heA djU ) T .h = \ M } ^P=l
3.1.3 Compute the summations and Z)t_x
3 . 1 . 4  Compute row and column spin-average vectors V [  and V,·
3.1.5 Compute the energy changes lA.E' =  and
3.1.6 Update row and column spin-average vectors V [  and VJ"
- r(new) J r cinew)
3.2 r*· =  a  X 7”· and =  a  X
Figure 6 . 1 . The proposed efficient MFA algorithm for the mapping problem 
for mesh-connected Architectures.
CHAPTER 6. MEA SOLUTION FOR MAPPING 49
and hence the complexity of the MFA iteration, by asymptotical factors. Mean 
field theory equations given in Section 6 .2 . 1  reveals the symmetry between the 
mean field vector computations in row and column iterations. Hence, the pro­
posed implementation scheme will only be discussed for computing the mean 
field vector $,■ =  [<i>a·, · · ■ ■, · ■ ■, 4>ipY in row iterations. Similar discussion
applies to the computation of the  ^  ^ vector in column
iterations.
Assume that row Potts spin i is selected at random in a row iteration 
for updating its expected value vector V·". W e will first discuss the mean 
field computations corresponding to the vertical communication cost. As 
is seen in Eq. (6.20), these computations require the construction of the 
z ;  =  vector for each vertex j  adjacent to i in TIG .
The computation of an individual vector necessitates the construction of 
Fj =  [F/ i , . . . ,  F/p, . . . ,  FJpY and . . . ,  . . . ,  L’jpY vectors. These
two vectors can be constructed in 0{P) time using the recursive equation
=  ^Ik-l + '^ jki for k = 2,3 ,... ,P (6 .2 2 )
where Fp = -P
for k = P — l ,P —2,... (6.23)
where j^P =  j^P
The computation of an individual ZJp value takes 0 (P )  time. Hence, the 
complexity of computing an individual Zt vector becomes Q{P^). However, in 
the proposed scheme the elements of the Z'j vector are computed in only Q{P) 
time by exploiting the recursive equation
P
where ZJi =Y^Lji
1=2
(6,24)
Hence, the complexity of mean field computations corresponding to the vertical 
communication costs term is 0 (d,P) in a row iteration since the first summation 
term in Eq. (6.20) requires the computation and weighted addition of d,· such 
Z^ vectors. Here, d, denotes the degree of vertex i in the TIG. Similarly, 
the complexity of mean field computations corresponding to the horizontal 
communication cost term is 0 (d,Q) when column spin i is selected at random 
in a column iteration.
As is seen in Eq. (6.20), the complexity of computing an individual mean 
field value corresponding to the imbalance term is Q{NQ). Since P such values
CHAPTER 6. МЕЛ SOLUTION EOR MAPPING 50
are computed in a row iteration, the total complexity of mean field computa­
tions corresponding to the imbalance cost term becomes Q{NPQ). However, 
the complexity of these computations can be cisymptotically reduced as follows. 
The second summation term in Eq. (6.20) can be re-written by interchanging 
the order of summations as
w,
N Q Q N
E W i V ’  ^ y? y": =  } IP ¿ - i  tq jgq=l <J=I
Q
q = l
N
(6.25)
where = = E ^ j^JP ’^ jq 
>=1
(6.26)
Here, Wpg denotes the total computational load of processor pq for the current 
row and column spin values. In Eq. (6.26), Wpg — denotes the weight
of processor pq excluding task i. Hence, Eq. (6.26) represents the increase in 
the imbalance cost term if task г is assigned to row p (i.e., is set to l).In  
the proposed implementation scheme, we maintain a P by Q processor weight 
matrix W  consisting of Wpg values. The entries of this matrix are computed 
using Eq. (6.26) only at the beginning of the algorithm. Then, while updating 
the expected value vector V [  of an individual Potts spin г, the W  matrix is 
updated in 0 (P C )  time using
ИДпеи,) ^  ^^Id) ^
for p =  1 , 2 , . . . ,  P  and q =  1 , 2 , . . . ,  Q. Hence, computing Ekj. (6.26) for each 
<^ -p value takes 0 (Q )  time. Since, P such values tire to be computed to con­
struct the mean field vector, the total complexity of mean field computations 
corresponding to the imbalance cost term reduces Q{PQ) in a row iteration.
It should be noted here that, column iterations also use and update the same 
weight matrix W  as is used and maintained in row iterations. The complexity 
of mean field computations corresponding to the imbalance cost term is also 
Q{QP) in column iterations. Thus, the proposed scheme reduces the overall 
complexity of mean field computations to Q{davgP +  PQ) and Q{davgQ +  PQ) 
in row and column iteration, respectively. Here, davg denotes the average vertex 
degree in TIG . After computing the mean field vectors Ф[ and Фу, expected 
value vectors V [ and Vy of row and column Potts spin i and j  can be up­
dated using Eq. (6.19.a) and Eq. (6.19.b) in 0 ( P )  and 0 ((? )  times, in a row 
and column iteration, respectively. The complexities of computing the energy
CHAPTER 6. MEA SOLUTIOy EOR MAPPL\G 51
difference AÆ’’' and AE'  ^ as shown at step 3.1.5 of Fig. 6.1 are 0 ( F )  and Q(Q) 
times, in a row and column iteration, respectively.
Therefore, the proposed implementation scheme reduces the complexity of 
an individual row and column iteration to Q{davgP +  PQ) and Q{davgQ +  PQ), 
respectively. Note that, a row and a column iteration pair corresponds to a 
single iteration of the general MFA formulation discussed in Section 6.1. Hence 
the proposed MFA scheme asymptotically reduces the complexity of a single 
MFA iteration from Q{davgPQ +  (PQ)^) of the general MFA formulation to 
0(d a v g (P + Q )+ P Q ) for a P  by Q mesh. For a square mesh with K processors, 
this corresponds to an asymptotical complexity reduction from Q(davgK +  K"^ ) 
to Q{da,gy/K + K).
6.2.2 MFA Formulation For Hypercube Architecture
Consider M  dimensional hypercube, encoding in the general MFA formulation 
summarized in Section 6.1 needs N x K  variables for problem representation. 
Here, N is the number of task and M  =  log(K). In this section, we propose 
a new MFA formulation for hypercube type multicomputers which necessi­
tates N X log{K) variables for problem representation. For sake of simplicity, 
some definition about hypercube are given below. The communication distance 
between any two processors is equal to Hamming distance between those two 
processors. The Hamming Distance between two processors in hypercube is de­
fined as the number of different bits between those two processor id’s (binary 
representation of processor ids). A dimension i refers to the communication 
links between the processors whose processors ids differs on the ith bit. A M  
dimensional hypercube can be divided into two (A / — 1) dimensional subcube 
along the any dimension. Therefore, M  dimensional hypercube can be divided 
into two (M  — 1) subcube in M  different ways (dimension). We define two 
(A / — 1) dimensional subcubes H' and H' which is constructed by dividing M  
dimensional hypercube along the ith dimension. Subcube H' contains the pro­
cessor whose fth bit of ids is 1 and subcube H' contains the processors whose 
ithe bit is 0. In Figure 6.2, the 3-dimensional hypercube is divided into two 
2-dimensional subcubes in 3 different ways. In our new efficient formulation, 
each task is assigned to subcubes instead of processors.
In hypercube topologies, using Ising model is more suitable than Potts 
model, because in Ising model spins can be in one of the two states represented
CHAPTER 6. MEA SOLUTION FOR MAPPING 52
3 dimensionaJ 
Hypa“Ojbc
Figure 6.2. Three different ways for dividing 3-dimensionaI hypercube to 2 
2-dimensionaI subcubes
by 0 and 1. So, for each M  — 1 dimensional subcube of the M  dimensional 
hypercube, one Ising spin is used for encoding. To encode the configuration 
space of the mapping problem, one Ising spin is assigned to each M  — 1 dimen­
sional subcube of the hypercube . Totally M  Ising spin is represented for each 
teisk i. Here M  is the number of dimension of the hypercube and if there is K 
processor in hypercube, then M  =  log(K).
There will be a total of |A^ | x log{K) Ising spins in the system for encoding 
the configuration space of the problem. Note that, this encoding constructs 
the one-to-one mapping between the configuration space of the problem domain 
and the spin domain. This encoding is much more efficient than the general 
M FA encoding which requires lA^ I x K  spins for encoding.
The spin which is assigned to task i and represented to subcube H' of the 
hypercube is labeled as s·” . If a is 1, we say that the corresponding task is 
mapped to one of the processors the Hm subcube.
The average u·" = <  s·" >  of each spin, s"* is computed and iteratively 
updated until the system stabilizes at some fixed point. We define
uf* = <  >  where m = .. ^log{K)
Here €  { 0 , 1 } ,  whereas u f  €  [0,1], In order to construct an energy function.
CHAPTER 6. MEA SOU TIOX TOR MAPPISG 53
it is helpful to associate the following meaning to values.
v'A =  ■pjtask i is mapped to one of the processors in subcube / / ' " }
For simplicity, the energy' computation is divided to two part, interconnec­
tion communication energy term {E c o m )  and imbalance energy term { E m )·
E -  Ecom + r  X  Ekal
W e derive the interconnection communication energy function for mapping 
problem as follows.
 ^ N  N  log(K)
Ecom =  r  X )  P{ task i is mapped to one of the processor in x
2 ,= i  /=1
V{ task j is mapped to one of the processors in H'^.27)
, N  N  iog(K)
=  5 E E " · /  E  " ' x i i - * ; )  (6-28)
 ^ ,=1 /=1
W e consider the load-imbalaince term for each processors so we formulate the 
energy term correspond the imbalance cost as 
 ^ N  N  K
Ebal =  i E E  WiWj X ^ ^ {ta s k  i is mapped to processor p} x
2 ,= i  jji,· p = i
P {task  j is mapped to processor p)
=  (6.29)
 ^ .=1 J5i ,· P=1
Here, Sf is the probability of task i is mapped to processor p. For example, 
we have 4-dimensional hypercube and the probability of task i mapped to 
processor 9 is Sf =  [sfsfs^s;] — {sfx ( 1 — sf ) x ( 1 — sf) xs · )  we define «Sf 
as
«Sf =  where 2,· =  ms|· -f rh(l — sj) (6.30)
Here 2 ,· is s· or (1 — s|) according to the binary representation of the processor 
number p. In equation (6.30), m is 1 or 0 if the /-th bit of the processor number 
is 1 or 0. Total energy term can be defined in terms of communication cost 
term and the imbalance term as
E — Ecom “h  ^ ^ Effctl
« .V N  log(K)
 ^ .zrl ijij l=l
N  N  K1
 ^ «=1 >,i. p=l
(6.31)
CHAPTER в. МЕЛ SOLUTION EOR MAPPINC
In MF'A algorithm, the expected values of each spin sj" are iteratively 
updated using Boltzmann distribution as
u”· = -------- -^--------
• 1 +  е-^Г'/ î ’ (6.32)
Each <f>^ denotes the decrease in the energy' function. Hence for the for­
mulation of mapping problem for hypercube — may be interpreted as the 
decrease in the overall solution quality by assigning task i to one of the pro­
cessors in subcube i / ”*. In this work the mean field values are computed as
<PA ф'"' . -f- г X ф Р ! ■Tl Tcom,t ' Tbalyt
The mean field values coming from the communication energy term is cal­
culated as
^  =  - _ E j )  (6.33)
m
corn,!
jeAdj(i)
Here if is positive then uf* is attracted to 1. This means that probability 
of task i is mapped to one of the processor whose m-th bit is 1. Also ,if ■ 
is negative then uj” is attracted to 0. This means that probability of task i 
is not mapped to one of the processor whose m-th bit of binary number is 1. 
The computation of the mean field value for communication cost takes 0{davg) 
where davg is the average vertex degree of TIG .
Second Term of the mean field value is the imbalance energy term is calcu­
lated as
im ^^ bal  ^ kri^ og{K) I
^ j = l P = l
=  y ;  (6.34)
p=l J=hjjti
Here a  is 1 or -1 according to m-th bit of the processor p. To simplify the 
equation (6.34), the product term is substituted by in equation (6.30).
1 к  <jp N
(6.35)
p=l «
As seen equation (6.35), the complexity of computing an individual mean field 
value corresponding the imbalance cost is 0((A^| xK).  However, the complexity
CHAPTER 6. MEA SOLVTIOX EOR M APR ¡SC 55
of the computation can be asymtoticaly reduced as follows.
c  =
1 /05(A)
where
E  o ,{ s n s T ) {w -s n
-  p = l
(6.36)
(6.37)
Here, W '’ denotes the weight of the processor p for current spin values. The 
parenthesis term inside the summation (6.36) denotes the weight of processor 
p excluding the task i. Hence (6.36) represent the increase in imbalance cost 
term, if task i is assigned to processor p. The entries of the W  vectors are 
computed using (6.37) at the beginning of the algorithm. Then, while updating 
the expected value of individual Ising spin ¿, the W  vector is updated in 0{K)  
by using iterative properties of equation (6.37). If the s[" is updated in MFA  
iteration then the W  vector is updated like as
aP{old)
=  W J , +  5 , where S ' " " ' “ ' =  xm(old) (6.38)
As the Sf value is updated in (9(1) times , updating the W  vector takes 0{K)  
times. Therefore total computation of mean field value for imbalance cost term 
((^^,) takes 0{K)  times.
In Figure 6.3, another method are given for calculating the mean field value 
for imbalance cost term which takes also 0{K).
If we add the mean field values from communication cost term (6.33) and 
imbalance term (6.36), the mean field value for given spin i and subcube
IS
1 1 loaiK)
= - E - i) - 5 E “WADOV - s f ) (6.39)
jeAdj{i)
As seen in (6.39), total computation of the mean field value for given spin 
i and dimension m is 0{davg +  K). Steps of the MFA algorithm for hypercube 
topologies is very similar to the M FA algorithm for mesh. In this M FA algo­
rithm one spin is selected randomly for each dimension. Therefore one MFA 
iteration requires log{K) mean field value computation. So complexity of the 
one MFA iteration is 0{davg x logK +  K x logK). Instead of 0{davg x K + K^) 
in the traditional MFA algorithm.
CUAPTElt 6. MFA SOLUTION FOR MAFPINO 56
sum =  0;
for k=0 to (p/2*·'·*)-! do 
for /= 0  to 2*^  — 1 do 
p = i X 2 '^+* +  /; 
q = p + 2^
VWP =  WP -  lOiSf 
VW’ = VW’ - WiS^  
sum =  sum +  <Sf(W’  — 
endfor 
endfor
= —Wi X (sum/sj")
Figure 6.3. The Mean field value calculation of given spin i of subcube IP
6.3 Performance Evaluation
This section presents the performance evaluation of the efficient MFA formu­
lation proposed for the mapping problem for mesh-connected architectures 
in comparison with the well known mapping heuristics: simulated anneal­
ing (SA),  Kernighan-Lin (KL) and the general M FA formulation. Each al­
gorithm is tested using randomly generated mapping problem instances for 
mesh-connected architectures. The following paragraphs briefly present the 
implementation details of these algorithms.
The MFA algorithm proposed for the mapping problem for mesh topology is 
implemented efficiently as described in Section 6.2.1. At the very beginning the 
of the algorithm row and column spin averages are initialized to l/P  and l/Q  
plus a random disturbance term, so that the initial spin averages are uniformly 
distributed in the range
0.9 X i  <  t - i  ”" “ '» <  1.1 X
0.9 X i  <  t . ; · '" '" “'’ < 1 . 1 x 4
respectively. Note that limrr^o, 
initial temperatures and balance parameters used in the mean field computa­
tion of the row and column iterations are estimated using these initial random 
spin average values. Recall that, in the mean field computations (Eqs. (6.20))
for f =  1 , 2 , . . . . , A , p =  1 , 2 , . . . , P
for 1 =  1,2, . - . , A , i  =  i , 2 , . . . , g
iT — 
*P l/P  and lim7-:-.oo =  l / Q ·  The
CHAPTER 6. MFA SOLUTION TOR MAPPING 57
and (6.21) of row and column iterations, the parameters ¡S'" and deter­
mine a balance between the terms and and and respec­
tively. We compute the row spin averages {<i>ip^ )^ =
and (<i>ip^ )^ =  (H ill  T,pzzi using the initial u[p values. Column spin
averages and are computed similarly using the initial val­
ues. Then, balance parameters are computed as /5’’ =  CB{<t>'ip^ )^/{<p’ip^ )^ and
where Cb is chosen as 5.6. Our experiments show that 
computing /5’’ and using this method is sufficient for obtaining balanced 
partitions.
Selection of initial temperature parameters TJ and Tq is crucial for ob­
taining good quality solutions. In previous applications of M FA [18, 22], it 
is experimentally observed that spin averages tend to converge at a critical 
temperature. Although there are some methods proposed for the estimation of 
critical temperature, we prefer an experimental way for computing Tq and Tq 
which is easy to implement and successful as the results of experiments indicate. 
After the balance parameters and are fixed, average row and column mean 
fields are computed as {(f>\p) =  and
. Then Tq and Tq are computed using Tq =  and Tq =  CT{<i>i^ )/Q
where Ct is chosen as 20. Note that, both Tq and Tq aire inversely proportional 
to the dimensions of the row and column Potts spins, respectively, which is also 
observed for the critical temperature formulations presented in other MFA im­
plementations [18, 26].
The same cooling schedule is adopted for row and column iterations as fol­
lows. At each temperature, row and column iterations proceed in an alternative 
manner for randomly selected unconvergenced row and column spin updates 
until AE^ < t and AE^ <  t for A / consecutive iterations respectively where 
M = N initially and c =  0.05. Average spin values are tested for convergence 
after each update. If one of the terms of a row or column spin average 
vector is detected to be greater than 0.95, that spin is assumed to converge to 
state k. The cooling process is realized in two phases, slow cooling followed 
by fcLst cooling, similar to the cooling schedules used for SA [22]. In the slow 
cooling phase, row and column temperatures are decreased using a =  0.9 until 
T < 2o/1.5 for both row and column iterations. Then in the fast cooling phase,
M  is set to M /4 , a is set to 0.7 and cooling for row and column iterations are 
continued until 90% of the row and column spins converge, respectively. At 
the end of this cooling process, the maximum element in each unconvergenced 
spin average vector is set to 1 and all other elements in that vector are set to
CHAPTER 6. MFA SOLVTIQN FOR MAPPF\G 58
0. Then, the result is decoded cis described in Section 6.2, and the resulting 
mapping is found. Note that, all parameters used in this implementation are 
either constants or found automatically. Hence, there is no parameter setting 
problem for different mapping instances.
The general MFA formulation summarized in Section 6.1 is implemented 
efficiently as described in [6]. The initialization of spin averages, the selection 
of the balance parameter ^ and the initial temperature To are performed as 
is described for the mesh-specific MFA implementation. The expressions used 
for these computations can be found by replacing P and Q with K — PxQ \n 
those expressions described for the mesh-specific M FA implementation. The 
parameters Cj and Cb are chosen as 0.5. The same cooling schedule described 
for mesh-specific MFA implementation is used in the implementation of the 
general MFA formulation.
The two-phcise approach is used to apply KL to the mapping problem. KL  
heuristics is implemented efficiently as described by Fiduccia and Mattheyses 
(FM ) [6] for the clustering phase. The recursive bisection scheme implemented 
for the first phase recursively partitions the initial TIG  into two cluster until 
K — PxQ  clusters are obtained. Here, K is assumed to be a power of two. In 
the KLFM  heuristic, computational load balance among clusters is maintained 
implicitly by the algorithm. Vertex moves causing intolerable load imbalance 
are not considered. The one-to-one mapping heuristics used in the second 
phase is a variant of the KL heuristics. In this heuristic, communication cost 
is minimized by performing a sequence of cluster swaps between the processor 
pairs after an initial random mapping of K  clusters [21].
The SA algorithm implemented in this work implicitly achieves the load 
balance among processors by setting a neighborhood configuration consisting 
of all configurations which result from moving one task from the processor 
with maximum load to any other processor. Randomly selected possible moves 
which decrease the communication costs are realized. Acceptance probabilities 
of randomly selected moves that increase the communication cost are controlled 
with a temperature parameter T which is decreased using an automatic anneal­
ing schedule [22]. Hence, as the annealing proceeds acceptance probabilities of 
uphill moves decrease.
CHAPTER в. МЕЛ SOLVTION EOR MAPPING 59
Table 6.1. Total communication costs averages normalized with respect 
to mesh-specific MFA of the solution found by SA,KL,general MFA and 
mesh-specific MFA for randomly generated mapping problem instances for var­
ious mesh size
Problem Size Average Communication Cost
T IG Mesh MFA
N davg P x Q KL SA Gen. Mesh
2 4 x 4 1.20 0.83 1.16 1.00
2 4 x 8 2.62 0.76 1.09 1.00
400 3 4 x 4 1.14 1.01 1.13 1.00
3 4 x 8 1.96 0.94 1.07 1.00
4 4 X4 1.31 1.03 1.09 1.00
4 4 x 8 1.92 0.97 1.08 1.00
2 4 x 8 1.73 0.89 1.10 1.00
2 8 x 8 2.61 0.88 1.30 1.00
800 3 4 X8 2.20 1.13 1.41 1.00
3 8 x 8 2.88 1.06 1.00 1.00
4 4 x 8 1.65 1.14 1.13 1.00
4 8 x 8 2.55 1.17 1.20 1.00
2 8 x 8 1.61 0.99 0.93 1.00
2 8 x 16 2.89 1.05 1.15 1.00
1600 3 8 x 8 1.57 0.99 0.96 1.00
3 8 x 16 2.47 1.00 1.13 1.00
4 8 x 8 2.03 1.17 1.31 1.00
4 8 x 16 3.39 0.93 1.26 1.00
6.4 Experimental Results
The mapping heuristics are experimented by mapping randomly generated 
TIGs and test TIGs onto various size meshes. Random TIGs are generated 
using the following parameters: number of vertices (N), average vertex de­
gree (davg), maximum vertex weight (wmax) and maximum edge weight {emax)· 
In a random graph Gn,p with N vertices, each pair of vertices constitutes an 
edge with probability p. Since can have at most pC{N,2) edges, the
sum of the degrees of the vertices of Слг.р is equal to 2pC{N^2). Then, the 
expected average vertex degree of Gs,p is davg =  2pC{N,2)/N = p{N — 1). 
Thus, the parameter P is selected as p =  davg 1(1^  -  1) to generate a random 
TIG with N vertices and expected vertex degree davg· Then, the edge set is 
created by flipping a coin with probability p for all {N{N — l ) / 2  potential 
edges. Each vertex or edge is weighted randomly by choosing a number be­
tween 1 and Wmax or 1 and e^ axi respectively. Nine test TIGs generated with 
N =  400,800,1600, dovj =  2,3,4,iUmax =  5 and e^ar =  Ю using this random 
graph generation algorithm. These test TIGs are mapped to 4 x 4 ,  4 x 8 ,  8 x 8  
and 8 x 1 6  two-dimensional meshes.
Cl IA PTEli 6. MIA SOL UTION FOR MA RPISG GO
Table 6.2. Percent computational load imbalance averages of the solution found 
by SA,KL,general MFA and mesh-specific MFA for randomly generated map­
ping problem instances for various mesh size
Problem Size Average Percent Imbalance
T G Mesh MFA
N davg PxQ KL SA Gen. Mesh
2 4 x 4 9.1 2.1 8.6 7.8
2 4 x 8 14.5 6.5 11.1 8.3
400 3 4 X 4 11.4 4.4 8.6 4.5
3 4 x 8 15.5 5.5 9.7 8.3
4 4 x 4 11.9 4.0 5.1 7.9
4 4 X 8 16.1 7.8 12.7 6.3
2 4 x 8 12.0 5.8 16.2 7.8
2 8 x 8 16.7 8.4 12.7 8.7
800 3 4 x 8 15.6 3.5 8.7 5.2
3 8 x 8 19.7 9.6 16.0 8.2
4 4 x 8 16.5 13.8 7.9 14.2
4 8 x 8 19.0 6.6 6.2 6.9
2 8 x 8 13.8 9.3 12.7 8.2
2 8 x 16 21.0 9.4 13.9 7.9
1600 3 8 x 8 15.3 14.3 16.6 10.3
3 8 x 16 19.7 10.9 13.0 11.7
4 8 x 8 15.6 9.4 14.9 8.9
4 8 x 16 21.9 7.3 11.2 9.4
Table 6.3. Execution time averages of the solution found by SA,KL,geneггıl 
MFA and mesh-specific MFA for randomly generated mapping problem in­
stances for various mesh size
Problem Size Average Execution Time(sec)
TIG Mesh
SA
.MFA
N 1 PxQ KL Gen. Mesh
400
2 4 x 4 1.1 99.4 11.7 2.8
2 4 x 8 1.1 99.4 11.7 2.8
3 4 x 4 0.9 44.0 3.1 0.9
3 4 x 8 1.4 96.4 5.6 1.8
4 4 x 4 1.0 48.8 2.7 1.4
4 4 x 8 1.5 80.0 9.7 3.5
800
2 4 x 8 1.7 248.9 15.8 5.3
2 8 x 8 3.2 522.8 53.8 6.8
3 4 x 8 2.2 256.0 13.0 4.2
3 8 x 8 4.4 550.2 44.7 8.6
4 4 x 8 2.9 240.2 55.1 8.7
4 8 x 8 5.5 545.7 87.6 9.9
1600
2 8 x 8 5.4 1983.6 230.6 13.5
2 8 x 16 15.6 16793.4 1081.5 39.5
3 8 x 8 8.9 1826.5 157.2 18.2
3 8 x 16 24.1 4946.0 515.0 40.6
4 8 x 8 11.3 3095.6 206.2 15.1
4 8 x 16 51.0 5345.7 495.4 49.9
CHAPTER 6. MFA SOLUTION FOR MAPPING 61
Table 6.4. Average performance measures of the solution found by SA, KL, 
general MFA and mesh-specific MFA for randomly generated mapping problem 
instances
C O M M . COST
LOAD IM BALANCE
E XEC U TIO N  TIME
KL
2.10
2.01
0.67
SA
1.00
0.91
93.20
MFA
Gen.
1.13
1.49
8.17
Mesh.
1.00
1.00
1.00
Table 6.1, 6.2, 6.3 illustrates the performance result of the KL, SA , general 
and mesh-specific M FA heuristics for the generated mapping problem instances. 
In this table, ” Gen” and "M esh” denote the general and mesh-specific MFA  
formulations, respectively, discussed in this work. Each algorithm is executed 5 
times for each problem instance starting from different, randomly chosen initial 
configurations. Total communication cost averages of the solutions in Table 6.1 
are normalized with respect to the results of the mesh specific MFA heuristic 
developed in this work. Percent computational load imbalance averages of solu­
tions displayed in Table 6.2 are computed using l00x{CLmax — CLmin)IOLavg- 
Here, CLmax and CL^in denotes the maximum and minimum processor loads 
and CLavg denote the computational loads of processors under perfect load 
balance conditions. Execution time averages are measured on a DEC Alpha 
workstation in seconds for randomly generated mapping problem instances. 
Table 6.4 is constructed for a better illustration of the overall relative per­
formances of the heuristics. Percent load imbalance averages and execution 
time averages of the solutions are also normalized with respect to the results of 
the mesh-specific M FA heuristic. Then, the overall averages of the normalized 
averages of Table 6.1, 6.2, 6.3 are displayed in Table 6.4.
These four tables confirm the expectation that mesh-specific MFA formu­
lation is significantly feister (8.17 times on the average) than the general MFA  
formulation while producing solutions with considerably better qualities for 
randomly generated problem instances. As seen in these tables, the mesh spe­
cific MFA heuristic produces significantly better solutions than the KL heuristic 
whereas the MFA heuristic is slightly slower (only 1.49 times on the average). 
The qualities of the solutions obtained by the mesh-specific MFA heuristic are 
comparable with those of the SA heuristic. However, the mesh-specific MFA  
heuristic is orders of magnitudes faster (93.2 times on the average). Hence, 
the proposed MFA heuristic approaches the speed performance of the feist KL
CUAPTKR 6. MFA SOLUTIOS FOR MARRISG 62
Table 6.5. The Benchmark Sparce Matrix data used in experiments
heuristic while approaching the solution quality of the powerful SA heuristic.
Test t i g ’s correspond to the undirected sparse graphs associated with the 
symmetric sparse matrices selected from Harwel Boeing sparse matrix test col­
lection [12]. Weights of the vertices are assumed to be equal to their degrees. 
These test T IG ’s are mapped to 8 x 8 ,  8 x 16 and 16 x 16 2£)-meshes. The 
properties of test TIGs are shown in Table 6.5
Table 6.6, 6.7. 6.8 illustrates the performance result of the KL, SA , general 
and mesh-specific MFA heuristics for the mapping problem instances from test 
TIGs. Each algorithm is executed 5 times for each problem instance starting 
from different, randomly chosen initial configurations. Total communication 
cost averages of the solutions in Table 6.6 are normalized with respect to the 
results of the mesh specific MFA heuristic developed in this work. Execution 
time averages are measured on a SUN SPARC 10 workstation. Execution time 
averages are normalized with respect to those of mesh-specific M FA heuristic 
in Table 6.8. Table 6.9 is constructed for a better illustration of the overall 
relative performances of the heuristics. Percent load imbalance averages of the 
solutions are also normalized w'ith respect to the results of the mesh-specific 
MFA heuristic. Then, the overall averages of the normalized averages of Ta­
ble 6.6, 6.7. 6.8 are displayed in Table 2. Tables 6.6, 6.7, 6.8, 6.9 confirm the 
expectation that mesh-specific M FA formulation is significantly faster (7.26 
times on the average) than the general MF.A formulation while producing solu­
tions with considerably better qualities for test TIGs. As seen in these tables, 
the mesh specific MFA heuristic produces significantly better solutions than 
the KL heuristic whereas the M FA heuristic is slightly slower. The qualities of 
the solutions obtained by the mesh-specific MFA heuristic are comparable with 
those of the SA heuristic. However, the mesh-specific MFA heuristic is faster
CHAPTER 6. MPA SOLUTION TOR MAPPING 63
Table 6.6. Total communatication cost averages, normalized with respect 
to mesh-specific MFA, of the solution found by SA,KL,  general MFA and 
mesh-specific MFA for some bechmark mapping problem instances for vari­
ous mesh size
Com.Cost
Circuit Par MFA SA GcnMFA KL
16 1.00 0.82 1.39 0.95
32 1.00 1.11 1.89 1.61
DWT-492 64 1.00 0.97 1.74 1.98
128 1.00 1.13 2.52 2.33
256 1.00 1.10 2.62 1.90
16 1.00 0.83 1.48 0.74
32 1.00 0.95 1.98 1.17
DWT-758 64 1.00 0.95 2.02 1.79
128 1.00 1.10 2.75 2.85
256 1.00 1.38 4.03 3.34
16 1.00 0.85 1.18 0.99
32 1.00 0.95 1.71 1.25
DWT-1242 64 1.00 1.00 2.01 1.42
128 1.00 1.05 2.62 2.53
256 1.00 1.08 2.94 2.91
16 1.00 0.89 1.12 0.89
32 1.00 0.93 1.30 0.99
JAGMESH2 64 1.00 0.90 2.04 1.91
128 1.00 1.11 3.35 3.06
256 1.00 1.19 3.73 3.44
16 1.00 0.56 0.92 0.69
32 1.00 0.87 1.43 1.14
JAGMESH6 64 1.00 0.91 1.78 1.23
128 1.00 1.13 3.59 2.48
256 1.00 1.08 3.82 3.43
16 1.00 0.78 1.12 0.83
32 1.00 0.86 1.26 1.21
JAGMESH7 64 1.00 0.95 1.89 1.40
128 1.00 1.06 3.25 2.74
256 1.00 1.20 3.77 3.48
16 1.00 0.67 2.14 1.47
32 1.00 0.98 3.25 2.33
BCSPWR06 64 1.00 0.93 2.80 2.18
128 1.00 1.12 3.35 2.90
256 1.00 1.23 3.45 3.80
16 1.00 0.51 1.36 1.11
32 1.00 0.89 2.74 1.88
BCSPWR09 64 1.00 0.90 2.43 1.87
128 1.00 1.01 3.13 2,33
256 1.00 1.80 5.06 4.75
16 1.00 0.84 1.02 1.09
32 1.00 0.89 1.29 1.31
LSHP2233 64 1.00 0.81 1.88 1.37
128 1.00 0.97 3.63 2.20
256 1.00 1.12 2.68 3.31
16 1.00 0.65 1.05 0.37
32 1.00 0.66 1.23 0.43
LSHP3346 64 1.00 0.68 1.91 0.52
128 1.00 0.68 3.48 0.68
256 1.00 0.87 2.10 1.07
aiAPTKR (). MFA SOLUTION FOR MAPPING 64
Table 6.7. Load Imbalanced averages,of the solution found by SA,KL,  general 
MFA and mesh-specific MFA for some bechmark mapping problem instances 
for various mesh size
[x>ad-Bal
Circuit Par MFA SA GcnMFA KL
16 2.41 2.41 4.34 5.42
32 3.01 3.61 7.47 7.35
DWT-492 64 6.10 7. 32 8.54 9.76
128 11.00 15.00 15.50 17.00
256 19.00 35.00 26.00 28.00
16 1.62 0.92 3.79 6.45
32 2.45 2.15 5.52 9.45
DWT-758 64 4.20 5.25 5.68 9.38
128 7.75 14.37 9.25 12.25
256 9.00 26.25 15.00 16.50
16 1.13 0.57 3.55 7.86
32 1.60 1.48 4.60 8.08
DWT-1242 64 2.66 3.85 6.22 8.88
128 5.35 5.28 8.17 12.11
256 9.43 11.43 10.29 16.29
16 1.58 0.82 2.51 4.29
32 0.87 1.64 3.55 5.96
JAGMESH2 64 2.64 4.12 5.60 8.13
128 2.89 6.67 5.56 10.89
256 6.82 15.91 12.73 18.18
16 1.03 0.84 3.95 4.41
32 1.60 0.84 8.32 6.34
JAGMESH6 64 2.10 2.52 7.06 7.39
128 2.54 4.24 5.25 12.03
256 7.93 12.07 10.69 13.45
16 1.29 0.82 2.89 4.64
32 1.68 1.27 5.18 6.60
JAGMESH7 64 2.86 2.81 6.33 8.06
128 4.49 7.65 5.92 11.02
256 9.17 18.75 12.50 13.75
16 1.13 0.31 2.92 4.05
32 2.67 0.63 5.42 5.50
BCSPWR06 64 3.33 0.83 8.00 10.54
128 5.00 1.67 7.67 12.43
256 8.00 5.00 11.33 17.22
16 1.84 0.33 2.31 4.05
32 2.55 0.67 5.44 5.50
BCSPWR09 64 4.19 1.35 7.97 10.54
128 4.05 2.70 10.81 12.43
256 4.05 5.56 18.89 17.22
16 0.88 0.31 1.31 5.16
32 1.52 0.98 2.44 6.36
LSHP2233 64 2.30 1.23 5.39 8.04
128 2.45 2.94 3.92 9.41
256 3.73 7.84 12.07 10.20
16 0.51 0.31 1.21 4.05
32 2.02 0.98 1.87 5.50
LSHP3466 64 1.50 1.23 4.48 10.54
128 1.51 2.94 4.47 12.43
256 4.18 7.84 12.07 17.22
CHAPTER 6. MEA SOLUTION EOR MAPPING 65
Table 6.8. Total execution time, normalized with respect to mesh-specific M FA, 
of the solution found by SA,KL,  general MFA and mesh-specific MFA for some 
bechmark mapping problem instances for various mesh size
Execution Time
Circuit Par MFA SA GenMFA KL
16 1.00 54.70 3.09 0.24
32 1.00 16.73 2.78 0.12
DWT-492 64 1.00 17.56 4.27 0.29
128 1.00 4.64 1.70 0.33
256 1.00 3.91 2.45 2.28
16 1.00 63.29 2.48 0.19
32 1.00 24.00 2.17 0.11
DWT-758 64 1.00 15.98 3.34 0.15
128 1.00 5.70 1.63 0.23
256 1.00 5.39 2.65 1.69
16 1.00 89.19 6.10 0.18
32 1.00 27.50 5.01 0.08
DWT-1242 64 1.00 25.33 7.74 0.13
128 1.00 8.72 2.67 0.19
256 1.00 7.02 3.79 0.75
16 1.00 61.11 8.62 0.12
32 1.00 24.16 7.69 0.08
JAGMESH2 64 1.00 16.43 10.81 0.11
128 1.00 8.53 4.14 0.24
256 1.00 8.21 5.27 1.16
16 1.00 112.12 10.72 0.16
32 1.00 45.16 11.93 0.09
JAGMESH6 64 1.00 30.02 15.45 0.13
128 1.00 13.01 6.60 0.18
256 1.00 10.98 6.25 0.81
16 1.00 78.00 7.75 0.15
32 1.00 32.29 10.98 0.09
JAGMESH7 64 1.00 26.58 19.41 0.14
128 1.00 11.01 4.22 0.20
256 1.00 9.58 6.77 1.10
16 1.00 213.22 2.14 0.30
32 1.00 66.53 1.74 0.13
BCSPWR06 64 1.00 55.05 4.01 0.20
128 1.00 18.43 4.80 0.26
256 1.00 14.24 5.88 0.87
16 1.00 261.90 3.54 0.24
32 1.00 76.14 3.81 0.10
BCSPWR09 64 1.00 59.62 8.27 0.15
128 1.00 23.50 6.56 0.20
256 1.00 32.09 14.88 1.30
16 1.00 104.60 7.72 0.09
32 1.00 44.17 10.05 0.06
LSHP2233 64 1.00 34.47 17.28 0.09
128 1.00 17.48 7.22 0.13
256 1.00 13.95 2.19 0.57
16 1.00 53.11 11.11 0.03
32 1.00 22.63 12.44 0.02
LSHP3466 64 1.00 15.81 13.36 0.02
128 1.00 8.53 11.62 0.04
256 1.00 8.48 2.19 0.20
CHAPTER 6. MEA SOLUTION EOR MAPPING 66
Table 6.9. Average performance measures of the solutions found by SA, KL, 
general MFA and mesh-specific MFA for mapping problem instances.
KL SA
MFA
Gen. Mesh.
Communication Cost 2.55 1.08 2.94 1.00
Load Lmba]гαıce 2.34 1.5 1.85 1.00
Execution Time 0.5 19-7 7.26 1.00
Table 6.10. Total communication costs averages normalized with respect to 
hypercube-specific MFA of the solution found by SA,KL,general MFA and 
hypercube-specific M FA for randomly generated mapping problem instances 
for various hypercube size
Problem Size Average Communication Cost
TIG Hypercube MFA
N davg K KL SA Gen. Mesh
3 8 1.41 0.96 1.12 1.00
3 16 2.45 1.02 0.69 1.00
400 4 16 2.43 1.32 1.74 1.00
4 32 1.48 1.21 1.25 1.00
8 32 1.35 1.18 1.25 1.00
8 64 1.25 1.18 1.08 1.00
3 8 1.39 0.87 1.23 1.00
3 16 1.47 1.34 1.30 1.00
800 4 16 1.73 1.13 1.26 1.00
4 32 1.83 0-88 0.93 1.00
8 32 1.55 0.99 1.16 1.00
8 64 1.42 1.03 1.13 1.00
3 8 1.37 0.92 0.84 1.00
3 16 0.98 0.74 0.88 1.00
1600 4 16 0.86 0.74 1.14 1.00
4 32 1.56 0.87 1.26 1.00
8 32 1.26 0.98 1.00
8 64 1.68 1.14 1.36 1.00
(19.7 times on the average). Hence, the proposed MFA heuristic approaches 
the speed performance of the fast KL heuristic while approaching the solution 
quality of the powerful SA heuristic.
Table 6.10, 6.11, 6.12 illustrates the performance result of the KL, SA, gen­
eral and hypercube-specific MFA heuristics for the generated mapping prob­
lem instances. In this table, ” Gen” and "Hypercube” denote the general and 
hypercube-specific MFA formulations, respectively. Each algorithm is exe­
cuted 10 times for each problem instance starting from different, randomly 
chosen initial configurations. Total communication cost averages of the so­
lutions in Table 6.10 are normalized with respect to the results of the mesh 
specific MFA heuristic developed in this work. Percent computational load 
imbalance averages of solutions displayed in Table 6.2 are computed using
CHAPTER 6. MEA SOLUTION EOR MAPPING 67
Table 6.11. Percent computational load imbalance averages of the solution 
found by SA,KL,general MFA and hypecube-specific MFA for randomly gen­
erated mapping problem instances for various hypercube size
Problem Size Average Percent Imbedance
TIG Hypercube MFA
N davg PxQ KL SA Gen. Mesh
3 8 12.22 7.50 9.17 2.78
3 16 15.56 8.33 18.46 6.67
400 4 16 14.44 9.33 16.43 10.05
4 32 21.43 15.29 23.33 23.81
8 32 15.48 12.60 30.71 8.33
8 64 23.81 21.15 24.29 21.49
3 8 10.28 2.50 9.17 6.39
3 16 13.89 5.50 13.33 6.75
800 4 16 15.05 5.65 9.32 3.06
4 32 20.15 10.33 15.80 11.11
8 32 18.89 5.50 17.60 13.60
8 64 22.22 13.14 20.65 19.05
3 8 8.20 2.02 4.85 3.63
3 16 11.83 3.66 9.95 5.65
1600 4 16 12.82 3.82 6.97 3.79
4 32 16.67 6.91 11.29 8.60
8 32 15.87 7.68 12.58 8.58
8 64 25.56 7.11 15.33 9.88
Table 6.12. Execution time averages of the solution found by SA,KL,general 
MFA and hypercube-specific MFA for randomly generated mapping problem 
instances for various hypercubesize
Problem Size Average Execution Timc(sec)
TIG Hypercube MFA
N davg KL SA Gen. Mesh
3 8 0.77 41.27 8.55 0.81
3 16 1.13 64.57 18.75 2.35
400 4 16 1.23 62.49 7.41 1.97
4 32 2.17 106.25 10.48 6.77
8 32 1.52 79.87 6.18 3.00
8 64 2.58 124.63 8.58 4.63
3 8 1.26 123.65 7.78 1.49
3 16 1.91 147.90 15.07 3.99
800 4 16 2.15 156.51 7.53 3.20
4 32 2.95 252.31 15.65 7.19
8 32 4.37 410.88 15.85 5.45
8 64 13.62 707.90 44.46 13.26
3 8 2.42 209.69 22.64 2.64
3 16 0.31 329.72 29.66 7.06
1600 4 16 3.69 432.32 9.96 5.29
4 32 5.68 712.89 47.81 17.42
8 32 8.59 749.02 96.08 14.84
8 64 16.59 2462.81 241.73 45.38
CHAPTER 6. MFA SOLETIOS FOR MAPPISG 68
lOOx{CLmax — CLmin)/CLavg· Here, CLmax and CL,nin denotes the maxi­
mum and minimum processor loads and CLavg denote the computational loads 
of processors under perfect load balance conditions. Execution time averages 
are measured on a DEC Alpha workstation in seconds for randomly generated 
mapping problem instances.
Chapter 7
CONCLUSION
In this thesis, we try to solve two combinatorial optimization problems, global 
routing problem in design automation of FPGA and domain mapping problem 
in parallel processing, by using Mean Field Annealing method.
First of all. Static RAM  based Field Programmable gate arrays (FPGA)  
is modeled as 2-dimensional mesh graph. Than we have proposed an order- 
independent global routing algorithm, for FPG.A based on Mean Field Anneal­
ing. The performance of the proposed global routing algorithm is evaluated in 
comparison with the LocusRoute global router for ACM/SIGDA benchmark 
circuits. Initial experimental results indicate that the proposed MFA heuristic 
performs better than the LocusRoute.
We proposed an encoding scheme to applied MFA onto global routing prob­
lem for FPGA.  Our aim is to minimize the energ>' function of our spin (par­
ticles) system. It corresponds to minimize the our objective function, that is 
finding most uniform distribution routes of the nets (balanced routing). We 
expected from most uniform distribution of routes that the following detailed 
routing shows a good performance. (Decrease in total number of segment used, 
decrease in channel width, and decrease in average delay of nets).
Experimental results show that our expectation was true, the MFA al­
gorithm found more uniform distributed routing that LocusRoute algorithm, 
therefore the performance of the detailed routing for 100% routing is better in 
MFA than in LocusRoute for many benchmark circuits.
We have some difficulties in MFA formulation. In this formulation, it is the
69
CHAPTER 7. CONCLUSION 70
first time that Potts spins have different number of states. In Previous MFA  
formulation for various combinatorial optimization problem, all Potts spins 
have same number of state, therefore the affect of spin values on the problem 
remains same but now, as Potts spin vector has different dimension, the affects 
of spins on problem are different. This may cause some problem therefore we 
have to find a normalization function that keeps the affect of spins same.
Also if we can find better cooling schedule than we may get better results 
than we have got. Especially, critical temperature is very important here, if 
it is initialized to very low temperature, than MFA find a local minimum as a 
global minimum.
In the second part of this thesis, we have proposed an efficient map­
ping heuristic for mesh and parallel-connected parallel architecture based on 
Mean Field Annealing(MFA). We have also developed an efficient implemen­
tation scheme for the proposed mapping formulation. The proposed MFA  
scheme asymptotically reduces the complexity of a single MFA iteration from 
Q{davgPQ + {PQV) of the general MFA formulation to Q{davg{P+Q)+PQ) for 
a. P hy Q mesh. For a square mesh with K processors, this corresponds to an 
asymptotical complexity reduction from Q{davgK + K^) to Q{davg\iK -|- K). 
And for hypercube type architecture complexity of the one M FA iteration is 
0{davg X logK K y. logK) instead of 0{davg y K + K^) in the traditional 
MFA algorithm.
The performance of the proposed mapping heuristic is evaluated in compar­
ison with the well-known heuristics Kernighan-Lin (K L ), Simulated Annealing 
(SA) and general MFA formulation for a number of randomly generated map­
ping problem instances and Harwell-Boeing sparse matrix test collection. The 
proposed topology-specific MFA formulation is found to be significantly faster 
than the general MFA formulation as is expected. The proposed M FA heuristic 
is slightly slower than the fast KL heuristic. However, it always produces sig­
nificantly better solutions than the KL heuristic. The quality of the solutions 
obtained by the proposed MFA heuristic are comparable to those of the power­
ful SA heuristic. However, the proposed MFA heuristic is orders of magnitudes 
faster than the SA heuristic. If we can find a good cooling scheduling and 
initial temperature parameter, then we can get better results. W e conclude 
that for mapping problem, MFA can be located on the algorithms line between 
the KL and SA.
Bibliography
[1] Fundemental of Placement and Routing. Xilinx Company, SanJose, Cali­
fornia, 1990.
[2] The Programmable Gate Array Data Book. Xilinx Company, SanJose, 
California, 1992.
[3] S. Brown B. Tseng, J.Rose. Using architectural and cad interactions to 
improve fpga routing architecture. In First International Workshop on 
Field Programmable Gate Arrays., pages 2 -7 . A C M , 1992.
[4] S. H. Bokhari. On the mapping problem. IEEE Transactions on Gom- 
puters, 30(3):207-214, 1981.
[5] T. Bultan. Parellel mapping and circuit partitioning heuristic on mean 
field annealing. PhD thesis.
[6] T. Bultan and C. Aykanat. A new mapping heuristic based on mean field 
annealing. Journal of Parallel and Distributed Gomputing, 16:292-305, 
1992.
[7] F. Ercal C. Aykanat, F. Ozguner and P. Sadayappan. Iterative algorithms 
for solution of large sparse systems of linear equations on hypercubes. 
IEEE Transactions on Computers, 37:1554-1567, 1988.
[8] D. E. Vand den Bout and T. K. Miller. Improving the performance of the 
hopfield-tank neural network through normalization an annealing. Bio­
logical Cybernetics, 62:129-139, 1989.
[9] D. E. Vand den Bout and T. K. Miller. Graph partitioning using annealing 
neural networks. IEEE Transaction on Neural Networks, l(2):192-203, 
1990.
71
lilBUOGRAPHY 72
[10] C. M. Fiducciaand R. M. Mattheyses. A linear-time heuristic for improv­
ing network partitions. In Proceedings of the 19th ACM/IEEE Design 
Automation Conference, pages 175-181, 1982.
[11] R. Francis, J. Rose, and Z. Vranesic. Chortle-crt: Fast technology 
mapping for lookup table-based FP G As. In Proceedings of the 28th 
ACM/IEEE Design Automation Conference, pages 227-233, 1991.
[12] J. Lewis I. Duff, R. Grimes. Sparse matrix test problems. ACM Transac­
tion on Mathematical Software, 15(1):1 -14 , march 1989.
[13] B. Indurkhya and H. S Stone. Optimal partitioning of randomly gener­
ated distributed programs. IEEE Transaction on Software Engineering, 
12(3):453-495, 1986.
[14] S. Kaptanoglu J. Greene, V . Roychowdhury and A. El Gamal. Segmented 
channel routing. In International Conference on Computer Aided Design, 
pages 567-572. IEEE, 1990.
[15] F. Ercal P. Sadayappan J. Ramanujam. Task allocation by simulated 
annealing. In Proceeding of International Conference on Supercomputing, 
pages 475-497, Boston, MA. ,  May 1988.
[16] A. El Gamal J. Rose and A . Sangiovanni-Vincentelli. Architecture of 
field-programmable gate-array. Proceedings of IEEE, 81(7):1013-1029, 
July 1993.
[17] B. W . Kernighan and S. Lin. An efficient heuristic procedure for partition­
ing graphs. The Bell System Technical Journal, 49(2):291-307, February 
1970.
[18] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi. Optimization by 
simulated annealing. Science, 220(4598):671-680, May 1983.
[19] T. Lengauer. Combinatorial Algorithms for Integrated Circuit Layout. 
John Wiley and Sons, Inc., Chichester, West Sussex, England, 1990.
[20] S. Brown M. Khellah and Z. Vranesic. Minimizing interconnetction delays 
in array-based fpgas. In Proceedings of Canadian conference on VLSI, 
1994.
[21] F.Ercal P. Sadayyapan and J. Ramanujam. Cluster partitioning aproaches 
to mapping parallel programs onto hypercube. Parallel Computing, 13 :1- 
16, 1990.
BIBLIOGRAPHY 73
[22] C. Peterson and B. Soderberg. A new method for mapping optimization 
problems onto neural networks. International Journal of Neural Systems, 
3 (l) :3 -2 2 , 1989.
[23] B. Fallah J. Rose. Timing-driven routing segment assignment in fpgas. In 
Proceesings of Canadian Conference on VLSI, pages 1-7, 1992.
[24] J. Rose. Parallel global routing for standard cells. IEEE Transactions on 
Computer-Aided Design, 9(10):1085-1095, October 1990.
[25] Z. Vranesic S. Brown, J. Rose. A  detailed router for field-programable gate 
arrays. In International Conference on Computer Aided Design, pages 
382-385. IEEE, 1990.
[26] P. Sadayappan and F. Ercal. Nearest-neigbour mapping of finite ele­
ment graphs onto processor meshes. IEEE Transactions on Computers, 
36(12):1408-1424, 1987.
[27] N. Sherwani. Algorithms for VLSI Physical Design Automation. Kluwer 
Academic Publishers, 1993.
[28] J. Shield. Partitioning concurrent VLSI simulation programs onto a multi­
processor by simulated annealing. lEE Proceedings Part-G, 134(l):24-28, 
1987.
[29] B .A  Hendrickson W . Camp, S. J. Plimpton and R. W . Leland. Massively 
parallel methods for engineering and science problems. Communication 
of ACM, 37(4):31-41, April 1994.
