A new mapping heuristic based on mean field annealing by Bultan, T. & Aykanat, C.
A NEW MAPPING
HEURISTIC BASED ON
MEAN FIELD ANNEALING
1
Tevk Bultan and Cevdet Aykanat
Department of Computer Engineering and Information Science,
Bilkent University, 06533 Bilkent, Ankara, Turkey
1
This work is partially supported by Intel Supercomputer Systems Division under Grant SSD100791-2 and
Turkish Science and Research Council under Grant EEEAG-5.
1
A NEW MAPPING HEURISTIC BASED ON MFA
Abstract
A new mapping heuristic is developed, based on the recently proposed Mean Field
Annealing (MFA) algorithm. An ecient implementation scheme, which decreases
the complexity of the proposed algorithm by asymptotical factors, is also given. Per-
formance of the proposed MFA algorithm is evaluated in comparison with two well-
known heuristics; Simulated Annealing and Kernighan-Lin. Results of the exper-
iments indicate that MFA can be used as an alternative heuristic for solving the
mapping problem. Inherent parallelism of MFA is exploited by designing an ecient
parallel algorithm for the proposed MFA heuristic.
2
1 Introduction
Today, with the aid of VLSI technology, parallel computers not only exist in research labora-
tories, but are also available on the market as powerful, general purpose computers. Wide use
of parallel computers in various compute intensive applications makes the problem of mapping
parallel programs to parallel computers more crucial. The mapping problem arises while devel-
oping parallel programs for distributed-memory, message-passing parallel computers which are
usually named as multicomputers. In multicomputers, processors have neither shared memory
nor shared address space. Each processor can only access its local memory. Synchronization
and coordination among processors are achieved through explicit message passing. Processors
of a multicomputer are usually connected by utilizing one of the well-known direct intercon-
nection network topologies such as ring, mesh, hypercube, etc. These architectures have the
nice scalability feature due to the lack of shared resources and the increasing communication
bandwidth with the increasing number of processors. However, designing ecient parallel al-
gorithms for such architectures is not straightforward. An ecient parallel algorithm should
exploit the full potential power of the architecture. Processor idle time and the interprocessor
communication overhead may lead to poor utilization of the architecture, hence poor overall
system performance.
Parallel algorithm design for multicomputers can be divided into two steps. First step is
the decomposition of the problem into a set of interacting sequential sub-problems (or tasks)
which can be executed in parallel. Second step is mapping each one of these tasks to an
individual processor of the parallel architecture in such a way that the total execution time is
minimized. The second step, named as the mapping problem [4], is very crucial in designing
ecient parallel programs. In general, the mapping problem is known to be NP-hard [12, 13].
Hence, heuristics giving sub-optimal solutions are used to solve the problem [1, 4, 7, 12, 13,
21]. Two distinct approaches have been considered in the context of mapping heuristics; one
phase and two phase [7]. In one phase approaches, referred as many-to-one mapping, tasks of
the parallel program are directly mapped onto the processors of the multicomputer. In two
phase approaches, clustering phase is followed by one-to-one mapping phase. In the clustering
phase, tasks of the parallel program are partitioned into as many equal weighted clusters as the
number of processors of the multicomputer, while minimizing the total weight of the interactions
3
among clusters [21]. The problem solved in the clustering phase is identical to the multi-way
graph partitioning problem. In the one-to-one mapping phase, each cluster is assigned to an
individual processor of the multicomputer such that the total inter-processor communication is
minimized [21]. Kernighan-Lin (KL) [8, 14] and Simulated Annealing (SA) [15] heuristics are
two attractive algorithms widely used for solving the mapping problem [7, 19, 21, 22].
Heuristics proposed to solve the mapping problem are compute intensive. Solving the mapping
problem can be considered as a preprocessing performed before the execution of the parallel
program on the parallel computer. Sequential execution of the mapping heuristic may introduce
unacceptable preprocessing overhead, limiting the eciency of the parallel implementation.
Ecient parallel mapping heuristics are needed in such cases. The KL and SA heuristics are
inherently sequential, hence hard to parallelize. Ecient parallelizations of these algorithms
remain as important issues in parallel processing research.
In this work, a recently proposed algorithm, called Mean Field Annealing (MFA) [18, 24, 25] is
formulated for the many-to-one mapping problem. MFA combines the collective computation
property of Hopeld Neural Networks (HNN) with the annealing notion of SA. It is originally
proposed for solving traveling salesperson problem, as a working alternative to HNN [23].
MFA is also a general strategy as SA, and can be applied to dierent problems with suitable
formulations. Previous works on MFA [5, 6, 17, 18, 24, 25] show that it can be successfully
applied to various combinatorial optimization problems. MFA has the inherent parallelism that
exists in most of the neural network algorithms.
Section 2 presents a formal denition of the mapping problem by modeling parallel program
design process. In Section 3, general formulation of the MFA heuristic is presented. Section 4
presents the proposed formulation of the MFA algorithm for the mapping problem. An ecient
implementation scheme for the proposed algorithm is also described in this section. Section 5
presents the performance evaluation of the MFA algorithm for the mapping problem in com-
parison with two well known mapping heuristics; SA and KL. Finally, ecient parallelization
of the MFA algorithm for the mapping problem is proposed in Section 6.
4
2 The Mapping Problem
In various classes of problems, interaction pattern among the tasks is static. Hence, the de-
composition of the algorithm can be represented by a static task graph. Vertices of this graph
represent the atomic tasks and the edge set represents the interaction pattern among the tasks.
Relative computational costs of atomic tasks can be known or estimated prior to the execution
of the parallel program. Hence, weights can be associated with the vertices in order to denote
the computational costs of the corresponding tasks.
Two dierent models, Task Precedence Graph (TPG) and Task Interaction Graph (TIG), are
used for modeling static task interaction patterns[13, 20]. TPG is a directed graph where
directed edges represent execution dependencies. Each edge denotes a pair of tasks; source and
destination. The destination task can only be executed after the completion of the execution
of the source task. In general, only the subsets of tasks which are unreachable from each other
in TPG can be executed independently.
In the TIG model, interaction patterns are represented by undirected edges between vertices.
In this model, each atomic task can be executed simultaneously and independently. Each edge
denotes the need for the bidirectional interaction between corresponding pair of tasks at the
completion of the execution of these tasks. Edges may be associated with weights which denote
the amount of bidirectional information exchange involved between pairs of tasks. TIG usually
represents the repeated execution of the tasks with intervening task interactions denoted by
the edges.
The TIG model may seem to be unrealistic for general applications since it does not consider
the temporal interaction dependencies among the tasks [20]. However, there are various classes
of problems which can be successfully modeled with the TIG model. For example, iterative
solution of systems of equations arising in nite element applications [2, 20] and power system
simulations [3, 16], and VLSI simulation programs [22] are represented by TIGs. In this paper,
problems which can be represented by the TIG model are addressed.
In order to solve the mapping problem, parallel architecture must also be modeled in a way
that represents its architectural features. Parallel architectures can easily be represented by a
5
Processor Organization Graph (POG), where nodes represent the processors and edges repre-
sent the communication links. In fact, POG is a graphical representation of the interconnection
topology utilized for the organization of the processors of the parallel architecture. In general,
nodes and edges of a POG are not associated with weights since most of the commercially
available multicomputer architectures are homogeneous with identical processors and commu-
nication links.
In a multicomputer architecture, each adjacent pair of processors communicate with each other
over the communication link connecting them. Such communications are referred as single-hop
communications. However, each non-adjacent pair of processors can also communicate with
each other bymeans of software or hardware routing. Such communications are referred asmulti-
hop communications. Multi-hop communications are usually routed in a static manner over the
shortest path of links between the communicating pairs of processors. Communications between
non-adjacent pairs of processors can be associated with relative unit communication costs.
Unit communication cost is dened as the communication cost per unit of information. Unit
communication cost between a pair of processors will be a function of the shortest path between
these processors and the routing scheme used for multi-hop communications. For example, in
software routing, the unit communication cost is linearly proportional to the shortest path
distance between the pair of communicating processors. Hence, the communication topology of
the multicomputer can be modeled by an undirected complete graph, referred here as Processor
Communication Graph (PCG). The nodes of PCG represent the processors and the weights
associated with the edges represent the unit communication costs between pairs of processors.
As is mentioned earlier, PCG can easily be constructed using the topological properties of POG
and the routing scheme utilized for inter-processor communication.
The objective in mapping TIG to PCG is the minimization of the expected execution time of
the parallel program on the target architecture. Thus, the mapping problem can be modeled as
an optimization problem by associating the following quality measures with a good mapping :
(i) interprocessor communication overhead should be minimized, (ii) computational load should
be uniformly distributed among processors in order to minimize processor idle time.
A mapping problem instance can be formally represented with two undirected graphs, Task
Interaction Graph (TIG) and Processor Communication Graph (PCG). The TIG G
T
(V;E),
6
has jV j = N vertices labeled as (1; 2; : : : ; i; j; : : : ; N). Vertices of the G
T
represent the atomic
tasks of the parallel program. Vertex weight w
i
denotes the computational cost associated with
task i for 1  i  N . Edge weight e
ij
denotes the volume of interaction between tasks i and j
connected by edge (i; j) 2 E. The PCG G
P
(P;D), is a complete graph with jP j = K nodes and
jDj = (
K
2
) edges. Nodes of the G
P
, labeled as (1; 2; : : : ; p; q; : : : ;K), represent the processors
of the target multicomputer. Edge weight d
pq
, for 1  p; q  N and p 6= q, denotes the unit
communication cost between processors p and q.
Given an instance of the mapping problem with the TIG G
T
(V;E) and the PCG G
P
(P;D),
the question is to nd a many-to-one mapping function M : V ! P , which assigns each vertex
of the graph G
T
to a unique node of the graph G
P
, and minimizes the total interprocessor
communication cost (CC)
CC =
X
(i;j)2E;M(i)6=M(j)
e
ij
d
M(i)M(j)
(1)
while maintaining the computational load (CL
p
: computational load of processors p)
CL
p
=
X
i2V;M(i)=p
w
i
; 1  p  K (2)
of each processor balanced. Here,M(i) = p denotes the label (p) of the the processor that task i
is mapped to. In Eq. (1), each edge (i; j) of theG
T
contributes to the communication cost (CC),
only if vertices i and j are mapped to two dierent nodes of the G
P
, i.e. M(i) 6= M(j). The
amount of contribution is equal to the product of the volume of interaction e
ij
between these
two tasks and the unit communication cost d
pq
between processors p and q where p = M(i) and
q =M(j). The computational load of a processor is the summation of the weights of the tasks
assigned to that processor. Perfect load balance is achieved if CL
p
= (
P
N
i=1
w
i
)=K for each p,
1  p  K. Computational load balance of the processors can be explicitly included in the cost
function using a term which is minimized when all processor loads are equal. Another scheme
is to include load balance criteria implicitly in the algorithm. Figure 1 illustrates a sample
mapping problem instance. Figure 1(a) shows a TIG with N = 8 tasks. Figure 1(b) shows POG
of a 2-dimensional hypercube with K = 4 processors, and Figure 1(c) shows the corresponding
PCG. In Figure 1, numbers inside the circles denote the vertex labels, and numbers within the
parenthesis denote the vertex or edge weights. Binary labeling of the 2-dimensional hypercube
is also given in Figure 1(b). Note that, unit communication cost assignment to edges of PCG
7
1 2
3
6 7
54
8
4
1 2
3 4
1 2
3
(1) (2) (1)
(2)
(2) (1)
(2)
(1)
(2)
(1)
(1)
(3)(2)
(2)
(1)
(3)
(1)
(1) (1)
(1)
(2)
(2)
10 11
00 01
(a)
(b)
(c)
Figure 1: A mapping problem instance, with (a) TIG, (b) POG (which represents a 2-
dimensional hypercube) and (c) PCG.
is performed assuming software routing protocol for multi-hop communications. A solution to
the mapping problem instance shown in Figure 1 is
i 1 2 3 4 5 6 7 8
M(i) 4 4 2 1 3 2 1 3
Communication cost of this solution can be calculated as CC = 8. Computational loads of
the processors are CL
p
= 3 for 1  p  4. Hence, perfect load balance is achieved, since
(
P
8
i=1
w
i
)=4 = 3.
3 Mean Field Annealing
Mean Field Annealing (MFA) merges collective computation and annealing properties of Hop-
eld Neural Networks (HNN) [9, 10, 11] and Simulated Annealing (SA) [15], respectively, to
obtain a general algorithm for solving combinatorial optimization problems. HNN is used for
8
solving various optimization problems and reasonable results are obtained for small size prob-
lems [9]. However, simulations of this network reveals the fact that it is hard to obtain feasible
solutions for large problem sizes. Hence, the algorithm does not have a good scaling property,
which is a very important performance criterion for heuristic optimization algorithms. MFA is
proposed as a successful alternative to HNN [18, 23, 24, 25]. In the MFA algorithm, problem
representation is identical to HNN [9, 23, 24], but iterative scheme used to relax the system
is dierent. MFA can be used for solving a combinatorial optimization problem by choosing a
representation scheme in which the nal states of the spins can be decoded as a solution to the
target problem. Then, an energy function is constructed whose global minimum value corre-
sponds to the best solution of the problem to be solved. MFA is expected to compute the best
solution to the target problem, starting from a randomly chosen initial state, by minimizing
this energy function.
The MFA algorithm is derived by making an analogy to Ising spin model which is used to
estimate the state of a system of particles or spins in thermal equilibrium. This method was
rst proposed for solving the traveling salesperson problem [23] and then it is applied to the
graph partitioning problem [5, 6, 17, 25]. Here, general formulation of the MFA algorithm [25]
is given for the sake of completeness. In the Ising spin model, the energy of a system with S
spins has the following form:
H(s) =
1
2
S
X
k=1
X
l6=k

kl
s
k
s
l
+
S
X
k=1
h
k
s
k
(3)
Here, 
kl
indicates the level of interaction between spins k and l, and s
k
2 f0; 1g is the value
of spin k. It is assumed that 
kl
= 
lk
and 
kk
= 0 for 1  k; l; S. At thermal equilibrium,
spin average hs
k
i of spin k can be calculated using Boltzmann distribution as follows [23]
hs
k
i =
1
1 + e
 
k
=T
(4)
Here, 
k
= hH(s)ij
s
k
=0
  hH(s)ij
s
k
=1
represents the mean eld eecting on spin k, where the
energy average hH(s)i of the system is
hH(s)i =
S
X
k=1
X
l6=k

kl
hs
k
s
l
i+
S
X
k=1
h
k
hs
k
i (5)
The complexity of computing 
k
using Eq. (5) is exponential [25]. However, for large number
9
1. Get the initial temperature T
0
, and set T = T
0
2. Initialize the spin averages hsi = [hs
1
i; : : : ; hs
k
i; : : : ; hs
S
i]
3. While temperature T is in the cooling range DO
3.1 While system is not stabilized for current temperature DO
3.1.1 Select a spin k at random.
3.1.2 Compute 
k
, 
k
=  
P
l6=k

kl
hs
l
i   h
k
3.1.3 Update hs
k
i, hs
k
i = f1 + e
 
k
=T
g
 1
3.2 Update T according to the cooling schedule
Figure 2: The Mean Field Annealing algorithm.
of spins, mean eld approximation can be used to compute the energy average as
hH(s)i =
1
2
S
X
k=1
X
l6=k

kl
hs
k
ihs
l
i +
S
X
k=1
h
k
hs
k
i (6)
Since hH(s)i is linear in hs
k
i, mean eld 
k
can be computed using the following equation.

k
= hH(s)ij
s
k
=0
  hH(s)ij
s
k
=1
=  
@hH(s)i
@hs
i
i
=  
0
@
X
l6=k

kl
hs
l
i + h
k
1
A
(7)
Thus, the complexity of computing 
k
reduces to O(S).
At each temperature, starting with initial spin averages, the mean eld eecting on a randomly
selected spin is computed using Eq. (7). Then, spin average is updated using Eq. (4). This
process is repeated for a random sequence of spins until the system is stabilized for the current
temperature. The general form of the MFA algorithm derived from this iterative relaxation
scheme is shown in Figure (2). The MFA algorithm is used to nd the equilibrium point of a
system of S spins using an annealing process similar to SA.
HNN and SA have a major dierence; SA is an algorithm implemented in software, whereas
HNN is derived with a possible hardware implementation in mind. MFA is somewhere in
between, it is an algorithm implemented in software, having potential for hardware realiza-
tion [24, 25]. In this work MFA is treated as a software algorithm. Performance of MFA is
comparable to other software algorithms as SA and KL, conforming this point of view.
10
4 Mean Field Annealing for the Mapping Problem
In this section, we propose a formulation of the Mean Field Annealing (MFA) algorithm for
the mapping problem. The TIG and PCG models described in Section 2 are used to represent
the mapping problem. The formulation is rst presented for problem instances modeled by
dense TIGs. The modications in the formulation for the mapping problem instances that can
be modeled by sparse TIGs are presented later. In this section, we also present an ecient
implementation scheme for the proposed formulation.
4.1 Formulation
A spin matrix, which consists of N task-rows and K processor-columns, is used as the repre-
sentation scheme. That is, N K spins are used to encode the solution. The output s
ip
of a
spin (i; p) denotes the probability of mapping task i to processor p. Here, s
ip
is a continuous
variable in the range 0  s
ip
 1. When the MFA algorithm reaches to a solution, spin values
converge to either 1 or 0 indicating the result. If s
ip
converges to 1, this means that task i
is mapped to processor p. For example, a solution to the mapping problem instance given in
Figure 1 can be represented by the following N K spin matrix.
K Processors
z }| {
1 2 3 4
N Tasks
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
1 0 0 0 1
2 0 0 0 1
3 0 1 0 0
4 1 0 0 0
5 0 0 1 0
6 0 1 0 0
7 1 0 0 0
8 0 0 1 0
Note that, this solution is identical to the solution given at the end of Section 2.
Following energy (i.e., cost) function is proposed for the mapping problem
H(s) =
1
2
N
X
i=1
X
j 6=i
K
X
p=1
X
q 6=p
e
ij
s
ip
s
jq
d
pq
+
r
2
N
X
i=1
X
j 6=i
K
X
p=1
s
ip
s
jp
w
i
w
j
(8)
11
Here, e
ij
denotes the edge weight between the pair of tasks i and j, and w
i
denotes the weight
of task i in TIG. Edge weight between processors p and q in PCG is represented by d
pq
. Under
the mean eld approximation, the expression hH(s)i for the expected value of the cost function
will be similar to the expression given for H(s) in Eq. (8). However, in this case, s
ip
, s
iq
and
s
jp
should be replaced with hs
ip
i, hs
iq
i and hs
jp
i respectively. For the sake of simplicity, s
ip
is
used to denote the expected value of spin (i; p) (i.e., spin average hs
ip
i).
In Eq. (8), the term s
ip
 s
jq
denotes the probability that task i and task j are mapped to
two dierent processors p and q, respectively. Hence, the term e
ij
 s
ip
 s
jq
 d
pq
represents
the weighted interprocessor communication overhead introduced due to the mapping of tasks i
and j to dierent processors. Note that, in Eq. (8), the rst quadruple summation term
covers all processor pairs in PCG for each edge pair in TIG. Hence, this term denotes the total
interprocessor communication cost for a mapping represented by an instance of the spin matrix.
Then, minimization of the rst quadruple summation term corresponds to the minimization of
the interprocessor communication overhead.
Second triple summation term in Eq. (8) computes the summation of the inner products of
the weights of the tasks mapped to individual processors. Global minimum of this term occurs
when equal amount of task weights are mapped to each processor. If there is an imbalance
in the mapping, second triple summation term increases with the square of the amount of
the imbalance, penalizing imbalanced mappings. The parameter r in Eq. (8) is introduced to
maintain a balance between the two optimization objectives of the mapping problem.
Using the mean eld approximation described in Eq. (7), the expression for the mean eld 
ip
experienced by spin (i; p) is

ip
=  
@H(s)
@s
ip
=  
N
X
j 6=i
K
X
q 6=p
e
i;j
s
jq
d
pq
  r
N
X
j 6=i
s
jp
w
i
w
j
(9)
In a feasible mapping, each task should be mapped exclusively to a single processor. However,
there exists no penalty term in Eq. (8) to handle this feasibility constraint. This constraint is
explicitly handled while updating the spin values. As is seen in Eq. (4), individual spin average
s
ip
is proportional to e

ip
=T
, i.e. s
ip
 e

ip
=T
. Then, s
ip
can be normalized as
s
ip
=
e

ip
=T
P
K
q=1
e

iq
=T
(10)
12
This normalization enforces the summation of each row of the spin matrix to be equal to unity.
Hence, it is guaranteed that all rows of the spin matrix will have only one spin with output
value 1 when the system is stabilized.
Eq. (9) can be interpreted in the context of the mapping problem as follows. First double sum-
mation term represents the increase in the total interprocessor communication cost by mapping
task i to processor p. Second summation term represents the increase in the computational load
balance cost associated with processors p by mapping task i to processor p. Hence,  
ip
may
be interpreted as the decrease in the overall solution quality by mapping task i to processor p.
Then, in Eq. (10), s
ip
is updated such that the probability of task i being mapped to processor p
increases with increasing mean eld 
ip
experienced by spin (i; p). Hence, the MFA heuristic
can be considered as a gradient-descent type algorithm in this context. However, it is also a
stochastic algorithm, similar to SA, due to the random spin update scheme and the annealing
process.
In the general MFA algorithm given in Figure 2, a randomly chosen spin is updated at a time.
However, in the proposed formulation of MFA for the mapping problem, K spins of a randomly
chosen row of the spin matrix are updated at a time. Mean elds 
ip
, (1  p  K) experienced
by the spins at the i-th row of the spin matrix are computed using Eq. (9) for p = 1; 2; : : : ;K.
Then, the spin averages s
ip
; 1  p  K are updated using Eq. (10) for p = 1; 2; : : : ;K. Each
row update of the spin matrix is referred as a single iteration of the algorithm.
The system is observed after each spin-row update in order to detect the convergence to an
equilibrium state for a given temperature [24]. If energy function H does not decrease after a
certain number of consecutive spin-row updates, this means that the system is stabilized for
that temperature [24]. Then, T is decreased according to the cooling schedule, and iteration
process is re-initiated. Note that, the computation of the energy dierence H necessitates the
computation of H (Eq. (8)) at each iteration. The complexity of computing H is O(N
2
K
2
),
which drastically increases the complexity of one iteration of MFA. Here, we propose an ecient
scheme which reduces the complexity of energy dierence computation by an asymptotical
factor.
The incremental energy change H
ip
due to the incremental change s
ip
in the value of an
13
individual spin (i; p) is
H = H
ip
= 
ip
s
ip
(11)
from Eq. (7). Since, H(s) is linear in s
ip
(see Eq. (8)), above equation is valid for any amount
of change s
ip
in the value of spin (i; p), that is
H = H
ip
= 
ip
s
ip
(12)
At each iteration of the MFA algorithm, K spin values are updated in a synchronous manner.
Hence, Eq. (12) is valid for all spin updates performed in a particular iteration. Thus, energy
dierence due to the spin-row update operation in a particular iteration can be computed as
H =
K
X
p=1

ip
s
ip
(13)
where s
ip
= s
new
ip
  s
old
ip
. The complexity of computing Eq. (13) is only O(K) since mean eld
(
ip
) values are already computed for the spin updates.
The formulation of the MFA algorithm for the mapping problem instances with sparse TIGs is
as follows. The expression given for 
ip
(Eq. (9)) can be modied for sparse TIGs as

ip
=  
X
j2Adj(i)
K
X
q 6=p
e
i;j
s
jq
d
pq
  r
N
X
j 6=i
s
jp
w
i
w
j
(14)
Here, Adj(i) denotes the set of tasks connected to task i in the given TIG. Note that, sparsity
of TIG can only be exploited in the mean eld computations since spin update operations given
in Eq. (10) are dense operations which are not eected by the sparsity of TIG.
Figure 3 illustrates the MFA algorithm proposed for solving the mapping problem. Complex-
ity of computing rst double summation terms in Eq. (9) and Eq. (14) are O(N  K) and
O(d
avg
K) for dense and sparse TIGs respectively. Here, d
avg
denotes the average vertex
degree in the sparse TIG. Second summation operations in Eq. (9) and Eq. (14) are both
O(N) for dense and sparse TIGs. Then, complexity of a single mean eld computation is
O(N K) and O(d
avg
K + N) for dense (Eq. (9)) and sparse (Eq. (14)) TIGs respectively.
Hence, complexity of mean eld computations for a spin row is O(N  K
2
) for dense TIGs,
and O(d
avg
K
2
+N K) for sparse TIGs (step 3.1.2 in Figure 3). Spin update computations
(steps 3.1.3, 3.1.4 and 3.1.6) and energy dierence computation (step 3.1.5) are both O(K)
operations. Hence, the overall complexity of a single MFA iteration is O(N  K
2
) for dense
TIGs, and O(d
avg
K
2
+N K) for sparse TIGs.
14
1. Get the initial temperature T
0
, and set T = T
0
2. Initialize the spin averages s = [s
11
; : : : ; s
ip
; : : : ; s
NK
]
3. While temperature T is in the cooling range DO
3.1 While H is decreasing DO
3.1.1 Select a task i at random.
3.1.2 Compute mean elds of the spins at the i-th row

ip
=  
P
N
j 6=i
P
K
q 6=p
e
i;j
s
jq
d
pq
  r
P
N
j 6=i
s
jp
w
i
w
j
for 1  p  K
3.1.3 Compute the summation
P
K
p=1
e

ip
=T
3.1.4 Compute new spin values at the i-th row
s
new
ip
= e

ip
=T
=
P
K
p=1
e

ip
=T
for 1  p  K
3.1.5 Compute the energy change due to these spin updates
H =
P
K
p=1

ip
(s
new
ip
  s
ip
)
3.1.6 Update the spin values at the i-th row
s
ip
= s
new
ip
for 1  p  K
3.2 T =  T
Figure 3: The proposed MFA algorithm for the mapping problem.
4.2 An Ecient Implementation Scheme
As is mentioned earlier, the MFA algorithm proposed for the mapping problem is an iterative
process. The complexity of a single MFA iteration is mainly due to the mean eld computations.
In this section, we propose an ecient implementation scheme which reduces the complexity of
the mean eld computations, and hence the complexity of the MFA iteration, by asymptotical
factors.
Assume that, i-th spin-row is selected at random for update in a particular iteration. The
expression given for 
ip
(Eq. (9)) can be rewritten by changing the order of the rst double
15
summation as

ip
=  
K
X
q 6=p
d
pq
N
X
j 6=i
e
i;j
s
jq
  r
N
X
j 6=i
s
jp
w
i
w
j
=  
K
X
q 6=p
d
pq

iq
  r 
ip
(15)
where

iq
=
N
X
j 6=i
e
i;j
s
jq
(16)
 
ip
=
N
X
j 6=i
s
jp
w
i
w
j
(17)
Here, 
iq
represents the increase in the interprocessor communication by mapping task i to
a processor other then q (for the current mapping on processor q), assuming uniform unit
communication cost between all pairs of processors in PCG. Similarly, 
ip
represents the increase
in the computational load balance cost associated with processor p, by mapping task i to
processors p (for the current mapping on processor p).
For an ecient implementation, the overall mean eld computations involved in a single itera-
tion can be computed using the following matrix equation

i
=  D
i
  r	
i
=  
i
  r	
i
(18)
Here, D is a K K adjacency matrix representing PCG (i.e. D
pq
= d
pq
), and 
i
, 
i
	
i
and

i
= D
i
are column vectors with K elements, where

i
= [
i1
; : : : ; 
ip
; : : : ; 
iK
]
T

i
= [
i1
; : : : ; 
ip
; : : : ; 
iK
]
T
	
i
= [ 
i1
; : : : ;  
ip
; : : : ;  
iK
]
T

i
= [
i1
; : : : ; 
ip
; : : : ; 
iK
]
T
(19)
The complexity analysis of the proposed implementation scheme for dense TIGs is as follows.
Complexity of computing 
iq
and  
ip
are both O(N). Complexity of constructing 
i
and 	
i
vectors are both O(NK), since both vectors containK such entries. Complexity of computing
the matrix-vector product required in Eq. (18) is O(K
2
). Hence, the overall complexity of
computing the 
i
vector (Eq. (18)) reduces to O(N K +K
2
) = O(N K), since N  K
in general. The complexity of K spin updates and the computation of H are both O(K).
Thus, the proposed scheme reduces the computational complexity of a single MFA iteration to
O(N K) for dense TIGs with N  K.
16
The complexity analysis of the proposed implementation for sparse TIGs is as follows. Note
that, the sparsity of TIG can only be exploited in the computation of 
iq
values since

iq
=
N
X
j2Adj(i)
e
i;j
s
jq
(20)
for sparse TIGs. Hence, the complexity of computing an individual 
iq
is only O(d
avg
). Thus,
the complexity of constructing the 
i
vector reduces to O(d
avg
 K). The complexity of
computing the 
i
vector in Eq. (18) reduces to O(d
avg
K +K
2
). However, the complexity of
constructing the	
i
vector required in Eq. (18) is O(NK), dominating the overall complexity
of the mean eld computations. The complexity of computing the 	
i
vector can be reduced if
the computation of  
ip
in Eq. (17) is re-formulated as
 
ip
=
N
X
j 6=i
s
jp
w
i
w
j
= w
i
N
X
j 6=i
w
j
s
jp
= w
i
(
N
X
j=1
w
j
s
jp
  w
i
s
ip
)
 
ip
= w
i
(
p
  w
i
s
ip
) (21)
where 
p
=
P
N
j=1
w
j
s
jp
. Here, 
p
represents the computational load of processor p, for the
current mapping on processor p. Note that, computationally, 
p
represents the weighted sum
of spin values of the p-th column of the spin matrix. At the beginning of the MFA algorithm,
initial 
p
value for each column p (1  p  K) can be computed for the initial spin values.
Then, 
p
values can be updated at the end of each iteration (i.e. after spin updates) using

new
p
= 
old
p
  w
i
s
old
ip
+ w
i
s
new
ip
for 1  p  K (22)
The computation of initial 
p
values can be excluded from the complexity analysis since they are
computed only once at the very beginning of the algorithm. In this scheme, the computation
of an individual  
ip
using Eq. (21) is an O(1) operation. Hence, the construction of the 	
i
vector required in Eq. (18) becomes an O(K) operation. Thus, the complexity of mean eld
computations involved in a single iteration reduces to O(d
avg
K+K
2
). Note that, the update
of an individual 
p
value (using Eq. (22)) at the end of each iteration is an O(1) operation.
Hence, the overall complexity of 
p
updates is O(K) since K weighted column sums should
be updated at each iteration. Complexity of spin updates and energy dierence computation
are also O(K) for sparse TIGs. Hence, the implementation scheme proposed for sparse TIGs
reduces the complexity of a single MFA iteration to O(d
avg
K +K
2
).
17
5 Performance of Mean Field Annealing Algorithm
This section presents the performance evaluation of the Mean Field Annealing (MFA) algorithm
for the mapping problem, in comparison with two well-known mapping heuristics; Simulated
Annealing (SA) and Kernighan-Lin (KL). Each algorithm is tested using randomly generated
mapping problem instances. Following sections briey present the implementation details of
these algorithms.
5.1 MFA Implementation
The MFA algorithm (Figure 3) described in Section 4 is implemented in order to evaluate its
performance. Cooling process is started from an initial temperature which is found experi-
mentally. It is not feasible to search for an initial temperature for each problem instance, as
this process may take more time than solving the original problem. In order to avoid this, we
performed experiments for only a small number of instances and chose an initial temperature
which works for each one. For the mapping problem instances used in these experiments, initial
temperature was found to be T
0
= 5:0. This value for T
0
is used for all 26 mapping problem
instances involved in the experiments.
Coecient r, which determines the balance between two optimization criteria of the mapping
problem, is computed at the beginning of the MFA algorithm. After the spins are initialized
randomly, r is computed using these initial spin values as
r =
P
N
i=1
P
j 6=i
P
K
p=1
P
q 6=p
e
ij
s
ip
s
jq
d
pq
K 
P
N
i=1
P
j 6=i
P
K
p=1
s
ip
s
jp
w
i
w
j
(23)
As is seen from the equation, r is used for balancing of the two summation terms in the cost
function. Note that, r is inversely proportional to the number of processors.
At each temperature, iterations continue until H <  for L consecutive iterations where
L = N initially. Parameter  is chosen to be 0:5. Cooling process is realized in two phases;
slow cooling followed by fast cooling, similar to the cooling schedules used for SA [18]. In the
slow cooling phase, temperature is decreased using  = 0:9 until T is less than T
0
=1:5. Then,
in the fast cooling phase, L is set to L=4 and  is set to 0:5 and cooling is continued until T is
less then T
0
=5:0. At the end of this cooling process, maximum spin values at each row are set
18
to 1 and all other spin values are set to 0. Then the result is decoded as described in Section 4,
and the resulting mapping is found. Note that, all parameters used in this implementation are
either constants or found automatically. Hence, there is no parameter setting problem.
5.2 Kernighan-Lin Implementation
Kernighan-Lin heuristic is not directly applicable to the mapping problem since it was originally
proposed for graph bipartitioning. The two phase approach is used to apply the KL heuristic to
the mapping problem. In the rst phase, TIG is partitioned into K clusters, where K is equal
to the number of processors. These K clusters are then mapped to PCG using a one-to-one
mapping heuristic in the second phase. One-to-one mapping heuristic used in this work is a
variant of the KL heuristic.
For the clustering phase, Kernighan-Lin heuristic is implemented eciently as described by
Fiduccia and Mattheyses [8]. Two dierent schemes are utilized to apply KL to K-way graph
partitioning. First scheme, partitioning by recursive bisection (KL-RB), recursively partitions
the initial graph into two partitions untilK partitions are obtained. Other scheme, partitioning
by pairwise min-cut (KL-PM), starts with an initial K-way partitioning and then iteratively
minimizes the cutsizes between each pair of partitions until no improvement can be achieved.
In the KL heuristic, computational load balance is maintained implicitly by the algorithm.
Vertex (task) moves causing intolerable load imbalances are not considered.
In the beginning of the second phase, K clusters formed in the rst phase are mapped to the
K processors of the multicomputer randomly. After this initial mapping, communication cost
is minimized by performing a sequence of cluster swaps between processor pairs.
5.3 Simulated Annealing Implementation
The SA algorithm, implemented for solving the mapping problem, uses the one phase approach
to map TIG onto PCG. In simulated annealing, starting from a randomly chosen initial cong-
uration, conguration space is searched for the best solution using a probabilistic hill climbing
algorithm. A conguration of the mapping problem is a mapping between TIG and PCG,
which assigns each task in TIG to a processor in PCG. In order the search the conguration
19
space, neighborhood of a conguration must be dened. For the implementation in this work,
neighborhood of a conguration consists of all congurations which results with moving one
vertex (task) of TIG from the maximum loaded node (processor) of PCG to any other node
of PCG. At each iteration of the simulated annealing algorithm, one of the possible moves is
chosen randomly as a candidate move. Then, the resulting decrease in the total communica-
tion cost caused by the candidate move is calculated without changing the conguration. If
the candidate move decreases the cutsize, it is realized. If it increases the cutsize, then it is
realized with a probability which decreases with the amount of increase in the total cutsize.
Acceptance probabilities of the moves that increase the cost are controlled with a temperature
parameter T which is decreased using an annealing schedule. Hence, as the annealing proceeds
acceptance probabilities of uphill moves decrease. An automatic cooling schedule is used in the
implementation of the SA algorithm [18].
5.4 Experimental Results
In this section, performance of the MFA algorithm is discussed in comparison with the SA and
KL algorithms. These heuristics are experimented by mapping randomly generated TIGs onto
mesh and hypercube connected multicomputers.
Six test TIGs are generated with N = 200 and 400 vertices. Vertices of these TIGs are weighted
by assigning a randomly chosen integer weight between 1 and 10 to each vertex (1  w
i
 10,
for 1  i  N). Interaction patterns among the vertices of these TIGs are constructed as
follows. A maximum vertex degree, d
max
, is selected for each test TIG (d
max
= 8; 16; 32), and
degree d
i
of each vertex i is randomly chosen between 1 and d
max
(i.e. 1  d
i
 d
max
, for
1  i  N). Then, each vertex i of TIG is connected to d
i
randomly chosen vertices. Resulting
edges are weighted randomly with integer values varying between 1 and 10. These TIGs are
mapped to 3-, 4-, 5-dimensional hypercubes and 4  4, 4  8 two dimensional meshes. PCGs
corresponding to these interconnection topologies are constructed assuming software routing as
is described in Section 2.
Tables 1, 2 and 3 illustrate the performance results of the KL-RB, KL-PM, SA and MFA
heuristics for the generated mapping problem instances. In these tables, N and jEj denote
the number of vertices and edges in the test TIGs, respectively, and K denotes the number
20
Table 1: Total communication cost averages (and standard deviations) of the solutions found
by the KL-RB, KL-PM, SA and MFA heuristics, for randomly generated mapping problem
instances.
PROBLEM SIZE AVERAGE COMMUNICATION COST
N jEj K T KL-RB KL-PM SA MFA
200 544 8 H 1807:4 (68:7) 1846:0 (56:2) 1595:1 (28:1) 1701:6 (45:7)
200 544 16 H 2819:9 (67:1) 2747:1 (83:5) 2180:0 (16:3) 2318:2 (35:3)
200 544 32 H 4098:8 (123:3) 4710:4 (102:0) 2881:1 (32:4) 2971:6 (43:4)
200 1120 8 H 5421:9 (56:2) 5494:7 (62:9) 4946:4 (34:7) 5215:8 (83:2)
200 1120 16 H 7742:4 (104:5) 7816:1 (86:4) 6699:1 (54:9) 7013:8 (25:5)
200 1120 32 H 10377:1 (136:7) 11280:2 (153:6) 8495:7 (99:0) 8893:1 (67:4)
200 2152 8 H 12721:6 (152:3) 12959:0 (101:0) 12018:5 (62:1) 12349:0 (208:0)
200 2152 16 H 17828:9 (142:5) 17959:9 (132:6) 16197:0 (79:6) 16519:4 (173:6)
200 2152 32 H 23127:6 (109:8) 24260:3 (131:3) 20393:7 (183:6) 20607:3 (220:1)
400 1227 8 H 4360:6 (69:9) 4444:5 (51:3) 3772:3 (21:9) 4526:9 (66:9)
400 1227 16 H 6096:0 (98:3) 6073:2 (55:8) 5086:4 (33:1) 6046:5 (101:2)
400 1227 32 H 8420:2 (109:1) 7999:9 (100:9) 6466:4 (47:6) 7641:3 (98:4)
400 2283 8 H 11247:1 (126:3) 11491:5 (129:8) 10152:1 (67:6) 10838:7 (60:2)
400 2283 16 H 15566:7 (142:1) 15896:9 (159:2) 13629:6 (46:9) 14591:6 (142:1)
400 2283 32 H 20543:8 (154:8) 20527:1 (203:1) 17199:1 (42:4) 18365:2 (99:5)
400 4298 8 H 25318:3 (164:9) 25832:1 (219:8) 23506:9 (82:9) 25052:7 (147:5)
400 4298 16 H 34590:6 (230:6) 35395:4 (173:9) 31417:7 (84:8) 33597:3 (215:4)
400 4298 32 H 45053:8 (286:2) 45098:1 (300:3) 39507:2 (105:3) 42249:0 (175:8)
200 544 16 M 3364:2 (122:0) 3318:7 (83:4) 2658:5 (53:2) 2726:6 (67:7)
200 544 32 M 5618:7 (217:8) 6822:5 (147:3) 4260:5 (34:4) 4134:3 (101:1)
200 1120 16 M 9234:2 (161:6) 9318:2 (176:1) 8432:3 (135:7) 7875:1 (65:6)
200 1120 32 M 14659:9 (163:4) 16485:4 (104:1) 13556:0 (221:0) 11710:6 (188:2)
400 1227 16 M 7341:4 (105:5) 7357:0 (174:6) 6295:3 (86:3) 7401:6 (131:2)
400 1227 32 M 12207:4 (246:6) 11758:6 (240:5) 9909:5 (80:0) 11619:0 (162:0)
400 2283 16 M 18670:9 (177:8) 19133:0 (200:6) 17484:4 (143:7) 16845:9 (48:9)
400 2283 32 M 29827:0 (375:9) 30156:3 (280:7) 28308:7 (251:5) 25208:3 (174:6)
of processors in the target PCG. Interconnection topology of the target POG is denoted by
T , where H denotes the hypercube interconnection topology and M denotes the mesh inter-
connection topology. Each algorithm is executed 10 times for each problem instance starting
from dierent, randomly chosen initial congurations. Averages and standard deviations of the
results are illustrated in Tables 1, 2 and 3.
Tables 1 and 2 illustrate the quality of the solutions obtained by the KL-RB, KL-PM, SA and
MFA heuristics. Total communication cost averages (and standard deviations) of the solutions
are displayed in Table 1, and percent computational load imbalance averages (and standard
deviations) are displayed in Table 2. Percent load imbalance for each solution is computed
21
Table 2: Percent computational load imbalance averages (and standard deviations) of the
solutions found by the KL-RB, KL-PM, SA, MFA heuristics, for randomly generated mapping
problem instances.
PROBLEM SIZE AVERAGE PERCENT LOAD IMBALANCE
N jEj K T KL-RB KL-PM SA MFA
200 544 8 H 10:2 (1:9) 8:4 (0:8) 2:7 (1:3) 4:5 (1:5)
200 544 16 H 15:1 (1:9) 8:4 (0:4) 7:5 (1:3) 9:4 (1:9)
200 544 32 H 18:8 (1:9) 8:9 (1:4) 16:2 (4:1) 18:8 (3:4)
200 1120 8 H 12:4 (1:4) 9:0 (0:3) 3:1 (1:0) 3:9 (0:5)
200 1120 16 H 16:1 (1:0) 8:4 (0:6) 7:9 (3:3) 9:2 (1:4)
200 1120 32 H 19:7 (2:7) 11:0 (4:6) 21:1 (4:5) 16:3 (3:8)
200 2152 8 H 13:3 (0:7) 9:2 (0:2) 3:7 (1:3) 5:9 (1:5)
200 2152 16 H 17:7 (0:5) 8:7 (0:0) 9:2 (1:7) 14:9 (2:3)
200 2152 32 H 22:5 (2:5) 8:7 (0:0) 18:3 (3:4) 28:5 (3:5)
400 1227 8 H 12:0 (1:2) 9:4 (0:3) 1:6 (0:4) 1:6 (0:5)
400 1227 16 H 14:1 (1:9) 9:6 (0:3) 3:7 (0:9) 2:5 (0:5)
400 1227 32 H 18:8 (1:0) 9:5 (0:3) 7:5 (1:3) 5:4 (0:9)
400 2283 8 H 13:0 (1:3) 9:5 (0:2) 1:7 (1:0) 2:2 (0:8)
400 2283 16 H 16:0 (1:6) 9:3 (0:3) 4:0 (0:9) 4:8 (1:1)
400 2283 32 H 20:4 (1:5) 8:6 (0:3) 8:5 (1:7) 7:8 (1:0)
400 4298 8 H 13:3 (1:2) 9:9 (0:2) 2:1 (0:8) 1:7 (0:4)
400 4298 16 H 16:4 (1:7) 9:4 (0:3) 4:8 (1:5) 4:0 (0:6)
400 4298 32 H 20:3 (2:2) 9:6 (0:2) 7:7 (2:5) 8:4 (1:3)
200 544 16 M 15:2 (1:2) 8:4 (0:4) 8:4 (2:2) 13:0 (2:1)
200 544 32 M 18:3 (2:0) 8:7 (0:0) 15:1 (2:7) 33:9 (3:9)
200 1120 16 M 16:3 (1:5) 8:5 (0:4) 8:5 (1:5) 16:6 (1:4)
200 1120 32 M 19:4 (2:0) 9:4 (2:2) 20:7 (4:2) 37:0 (2:4)
400 1227 16 M 16:1 (1:7) 9:6 (0:2) 3:4 (0:7) 3:4 (0:6)
400 1227 32 M 18:7 (1:7) 9:7 (0:2) 12:0 (4:1) 7:1 (0:7)
400 2283 16 M 16:0 (1:6) 9:4 (0:1) 5:3 (1:1) 10:6 (1:3)
400 2283 32 M 20:1 (1:6) 8:6 (0:2) 10:4 (1:5) 22:6 (1:9)
proportional to the computational load dierence between maximum and minimum loaded
processors. Table 3 displays the execution time averages of the KL-RB, KL-PM, SA and MFA
heuristics. Table 4 is constructed for a better illustration of the overall performance of the MFA
algorithm in comparison with the KL and SA heuristics. For each problem instance, results
given in Tables 1, 2 and 3 are normalized with respect to the results of the MFA algorithm. The
averages of the normalized results of Table 1, Table 2 and Table 3 constitute the rst, second
and fourth rows of Table 4, respectively. The average solution quality for each algorithm is
computed using
SOL'N QUALITY = 1 = (COMM. COST + LOAD IMBALANCE) (24)
Third row of Table 4, illustrates solution quality value of each algorithm normalized with respect
22
Table 3: Execution time averages (in seconds) of the KL-RB, KL-PM, SA and MFA heuristics,
for randomly generated mapping problem instances.
PROBLEM SIZE AVERAGE EXECUTION TIMES
N jEj K T KL-RB KL-PM SA MFA
200 544 8 H 1:1 5:7 67:3 2:3
200 544 16 H 1:5 13:7 127:2 5:6
200 544 32 H 3:3 29:6 245:1 15:6
200 1120 8 H 1:6 7:6 64:1 2:5
200 1120 16 H 2:2 14:6 144:0 6:6
200 1120 32 H 5:1 40:5 282:7 16:6
200 2152 8 H 2:5 10:9 64:2 2:2
200 2152 16 H 3:5 23:7 156:7 5:5
200 2152 32 H 7:6 45:4 373:9 14:0
400 1227 8 H 2:2 10:1 168:9 5:2
400 1227 16 H 3:0 29:7 310:7 12:0
400 1227 32 H 6:4 68:0 681:1 34:0
400 2283 8 H 3:3 16:0 167:1 5:6
400 2283 16 H 4:4 39:8 383:2 15:2
400 2283 32 H 8:6 88:9 632:8 42:0
400 4298 8 H 5:4 25:5 155:3 6:3
400 4298 16 H 7:1 64:9 403:0 15:4
400 4298 32 H 12:6 125:1 604:9 32:2
200 544 16 M 1:5 13:7 135:0 5:7
200 544 32 M 3:3 29:6 258:7 16:5
200 1120 16 M 2:3 14:8 124:2 5:7
200 1120 32 M 5:6 38:4 293:1 13:4
400 1227 16 M 3:1 26:7 280:5 10:9
400 1227 32 M 6:7 60:4 565:1 30:0
400 2283 16 M 4:4 41:7 363:8 13:3
400 2283 32 M 8:7 82:8 573:5 35:0
Table 4: Average performance measures of the KL-RB, KL-PM and SA heuristics normalized
with respect to the MFA heuristic.
KL-RB KL-PM SA MFA
COMM. COST 1:114 1:148 0:957 1:0
LOAD IMBALANCE 2:718 1:714 0:875 1:0
SOL'N QUALITY 0:522 0:699 1:092 1:0
EXECUTION TIME 0:407 2:800 23:308 1:0
23
to the MFA algorithm.
As is seen in Tables 1, 2 and 4, the quality of solutions obtained by the MFA and SA heuristics
are superior to those of the KL-RB and KL-PM heuristics. Solutions produced by SA are
slightly better compared with the solutions produced by MFA, whereas the MFA algorithm is
signicantly faster (23 times on the average). As is seen in Table 3 and 4, average execution
time of the MFA algorithm is comparable with that of the ecient KL heuristic. The MFA
algorithm is 2:8 times faster than the KL-PM heuristic and 2:5 times slower than the KL-RB
heuristic on the average. These results indicate that the proposed MFA algorithm is a promising
alternative heuristic for solving the mapping problem.
6 Parallelization of Mean Field Annealing Algorithm
As is mentioned earlier, heuristic used for solving the mapping problem is a preprocessing
overhead introduced for the ecient implementation of a given parallel program on the target
multicomputer. If the mapping heuristic is implemented sequentially, this preprocessing can be
considered in the serial portion of the parallel program which limits the maximum eciency of
the parallel program on the target machine. For a xed parallel program instance, the execution
time of the parallel program is expected to decrease with increasing number of processors in
the target multicomputer. However, as is seen in Table 3, for a xed TIG, the execution
time of all mapping heuristics increase with increasing number of processors in the target
multicomputer. Hence, the serial fraction of the parallel program will increase with increasing
number of processors. Thus, this preprocessing will begin to constitute a drastic limit on the
maximum eciency of the overall parallelization due to Amdahl's Law. Hence, parallelization
of these mapping heuristics on the target multicomputer is a crucial issue for ecient parallel
implementations.
Unfortunately, parallelization of the mapping heuristics introduces another mapping problem.
The computations of the mapping heuristics should be mapped to the processors of the same
target architecture. However, in this case, the parallel algorithm for the mapping heuristic
should be such that its mapping can be achieved intuitively. Furthermore, the intuitivemapping
should lead to an ecient parallel implementation of the mapping heuristic. For these reasons,
24
the target mapping heuristic to be parallelized should involve regular and inherently parallel
computations. The MFA algorithm proposed in Section 4 for the general mapping problem has
such nice properties for an ecient parallelization. Following paragraphs discuss the ecient
parallelization of the proposed mapping heuristic for multicomputers.
Assume that, the MFA algorithm is used to map a given parallel program represented with
a TIG having N vertices on a target multicomputer with K processors. The MFA algorithm
will use an N K spin matrix for the mapping operation. The question is to map the com-
putations of the MFA algorithm to the same target multicomputer (with the same number of
K processors). As is mentioned earlier, the MFA algorithm is an iterative algorithm. Hence,
the mapping scheme can be devised by analyzing the computations involved in a particular
iteration of the algorithm. Atomic task can be considered as the computations required for
updating an individual spin. Note that, K spin averages at a particular row of the spin matrix
are updated at each iteration. Hence, these K spin updates can be computed in parallel by
mapping each spin in a row of the spin matrix to a distinct processor of the target architecture.
Thus, the N K spin matrix is partitioned column-wise such that each processor is assigned
an individual column of the spin matrix. That is, column p of the spin matrix is mapped
to processor p of the target architecture. Each processor is responsible for maintaining and
updating the spin values in its local column. Assume that, task-i is selected at random in a
particular iteration. Then, each processor is responsible for updating the probability of task i
being mapped to itself.
A single iteration of the MFA algorithm can be considered as a three phase process, namely,
mean eld computation phase, spin update phase, and energy dierence computation phase.
Each processor p should compute its local mean eld value 
ip
(Eq. (9) or Eq. (14)) in the
rst phase, in order to update its local spin value s
ip
(Eq. (10)) in the second phase. As
is mentioned earlier, mean eld computation phase is the most time consuming phase of the
MFA algorithm. Fortunately, mean eld computations are inherently parallel since there are no
interactions among the mean eld computations involved in a particular iteration. However, a
close look to Eq. (9) reveals that each processor needs most recently updated values of all spins
except the ones in the i-th row in order to compute its local mean eld value. Recall that, each
processor maintains only a single column of updated spin values due to the proposed mapping
scheme. Hence, this computational interaction necessitates global interprocessor communication
25
just prior to the distributed mean eld computation at each iteration. The volume of global
interprocessor communication is proportional to O(N  K) for dense TIGs. As is seen in
Eq.(14), the volume of global interprocessor communication is proportional to O(d
avg
K) for
sparse TIGs. The volume of global interprocessor communication can be reduced to O(K) for
both dense and sparse TIGs by considering the parallelization of the matrix equation given in
Eq. (18).
Eq. (18) involves the following operations : construction of the 
i
and 	
i
vectors, dense matrix
vector product 
i
= D
i
and vector addition 
i
=  
i
  r	
i
. Note that, each processor
p only needs to compute the p-th entry 
ip
of the 
i
vector, and the p-th entry  
ip
of the 	
i
vector in order to compute its local mean eld value 
ip
in parallel. The matrix vector product
can be performed in parallel by employing the scalar accumulation (SA-MVP) scheme. In this
scheme, each processor needs only the p-th row d
p
of the dense D matrix and the whole column
vector 
i
.
Each processor p can concurrently compute the p-th entry 
ip
of the 
i
vector using Eq. (16)
or Eq. (20) without any interprocessor communication. Note that, q in these equations should
be replaced by p in these computations. Then, a global collect (GCOL) operation is required
for each processor to obtain a local copy of the 
i
vector. The GCOL operation is essentially
appending K local scalars, in order, into a vector of size K and then duplicating this vector
in the local memory of each processor. The GCOL operation requires global interprocessor
communication. Note that, only K local spin values should be collected globally thus reducing
the volume of communication during the GCOL operation by an asymptotical factor.
After the GCOL operation, each processor has a local copy of the global 
i
vector. Hence, each
processor p can concurrently compute its local 
ip
by performing the inner-product 
ip
= d
p

i
.
Then, each processor p should compute the p-th entry  
ip
of the 	
i
vector. Note that, each
processor p already maintains the 
old
p
value. Hence, each processor can concurrently compute
 
ip
using Eq. (21). Then, each processor p can concurrently compute its local mean eld value

ip
by performing the local computation 
ip
=  
ip
  r 
ip
. Note that, these computations are
completely local computations and involve no interprocessor communication.
The second phase of an individual MFA iteration is highly sequential since global interaction
exists among spin updates due to the normalization process indicated by Eq. (10). Fortunately,
26
this global interaction can be relieved by noting the independent exponentiation operations
involved in the numerator of Eq. (10). Hence, each processor p can concurrently compute its
local e

ip
=T
value. Then, a global sum (GSUM) operation is required for each processor to
obtain a local copy of the global sum of the local exponentiation results. The GSUM operation
requires global interprocessor communication. After the GSUM operation each processor p can
concurrently update its local spin value by computing Eq. (10). After computing s
new
ip
, each
processor p should concurrently update its local 
p
values according to Eq. (22) for the use in
the next iteration.
In the third phase, each processor should compute the same local copy of the global energy
dierence H for global termination detection. Each processor p can concurrently compute its
local energy dierence H
ip
= 
ip
s
ip
= 
ip
(s
new
ip
  s
old
ip
) due to its local spin update. Then,
a GSUM operation, which requires global interprocessor communication, is required for each
processor to compute a local copy of the global sum H =
P
K
p=1
H
ip
.
Hence, the proposed parallel MFA algorithm necessitates three global communication opera-
tions due to the GCOL operation involved during the rst phase and two GSUM operations
involved in the second and third phases. In ne grain multicomputers, the volume of interpro-
cessor communication is the important factor in predicting the complexity of the interprocessor
communication overhead. However, in medium grain multicomputers, the number of communi-
cations is also important since high set-up time overhead is associated with each communication
step. The set-up time is the dominating factor for short messages in such architectures. Note
that, only a single oating-point variable, representing the running sum, is communicated dur-
ing the GSUM operations involved in the last two phases of the parallel MFA algorithm.
Reducing the number of GSUM operations required in the MFA algorithm will be a valuable
asset in achieving ecient implementations on medium grain multicomputers. As seen in
Eq. (13), there is an execution dependency between the computation of the energy dierence
H and spin-row updates. This execution dependency between the second and the third phase
computations can be relieved by rewriting the expression for H as follows
27
H =
K
X
p=1

ip
(s
new
ip
  s
old
ip
) =
K
X
p=1

ip
s
new
ip
 
K
X
p=1

ip
s
old
ip
=
K
X
p=1

ip
e

ip
=T
P
K
q=1
e

iq
=T
 
K
X
p=1

ip
s
old
ip
=
1
A
i
K
X
p=1

ip
e

ip
=T
  C
i
=
B
i
A
i
  C
i
(25)
where A
i
=
P
K
p=1
e

ip
=T
=
P
K
p=1
a
ip
, B
i
=
P
K
p=1

ip
e

ip
=T
=
P
K
p=1
b
ip
and C
i
=
P
K
p=1

ip
s
old
ip
=
P
K
p=1
c
ip
. Hence, after each processor p computes its local a
ip
= e

ip
=T
, b
ip
= 
ip
e

ip
=T
and
c
ip
= 
ip
s
old
ip
values, three global summations A
i
=
P
K
p=1
a
ip
, B
i
=
P
K
p=1
b
ip
and C
i
=
P
K
p=1
c
ip
can be accumulated in a single GSUM operation. After this single GSUM operation, each
processor p can concurrently update its local spin value and compute the same local copy of
the global energy dierence as s
ip
= a
ip
=Ai and H = B
i
=A
i
 C
i
, respectively. Note that, this
scheme reduces the number of GSUM operation from two to one. Three oating point variables,
representing the running sums A
i
, B
i
, and C
i
, are communicated during the communications
involved in the GSUM operation.
The node program (of processor p, for 1  p  K) for a single iteration of the parallel MFA
algorithm proposed for solving the mapping problem is given in Figure 4. Note that, variables
with \ip" and \p" subscripts denote the local variables. Variables with \i" subscripts denote
the global variables which are constructed and duplicated at the local memory of each processor
after performing the indicated global operations. As is seen in Figure 4, the proposed parallel
MFA algorithm exhibits very regular computational structure even for mapping arbitrarily
irregular TIGs. The communication structure is also very regular since it necessitates only
GSUM and GCOL operations. Hence, the proposed parallel MFA algorithm can easily be
implemented on both MIMD and SIMD types of multicomputers.
The parallel communication complexity of a single MFA iteration can be analyzed as follows.
The interconnection schemes used in the processor organization of the multicomputers are usu-
ally symmetric in nature (i.e. POG is symmetric). Hence, GSUM and GCOL type global
operations in such architectures are usually performed by a sequence of concurrent communi-
cation steps. Each communication step, involves concurrent single-hop communications. The
number of concurrent single-hop communications is proportional to the diameter of POG for
both GSUM and GCOL operations. For example, diameters of hypercube and mesh POGs are
log
2
K and K
1=2
, respectively. The overall volume of concurrent interprocessor communications
28
1. Select a task i at random.
2. Compute 
ip
=
P
j2Adj(i)
e
ij
s
jp
3. Perform GCOL operation to obtain a local copy of

i
= [
i1
; : : : ; 
ip
; : : : ; 
iK
]
T
4. Compute the inner product 
ip
= d
p
T
 
i
5. Compute  
ip
= w
i
(
p
  w
i
s
ip
)
6. Compute the local mean eld value 
ip
=  
ip
  r 
ip
7. Compute a
ip
= e

ip
=T
, b
ip
= 
ip
e

ip
=T
and c
ip
= 
ip
s
ip
8. Perform GSUM to compute the local copies of
A
i
=
P
K
p=1
a
ip
B
i
=
P
K
p=1
b
ip
and C
i
=
P
K
p=1
c
ip
9. Compute s
new
ip
= a
ip
=A
i
and then s
ip
= s
new
ip
  s
ip
10. Compute H = B
i
=A
i
  C
i
11.Update 
p
= 
p
+ w
i
s
ip
12.Update s
ip
= s
new
ip
Figure 4: Node program (of processor p, for 1  p  K) for one iteration of the parallel MFA
algorithm for the mapping problem.
is proportional to the diameter and the number of processors (K) of POG for GSUM and
GCOL operations, respectively.
As is seen in Figure 4, the proposed parallel MFA algorithm achieves perfect load balance.
The parallel computational complexity of a single MFA iteration can be obtained as follows.
During the parallel computation of 
ip
values (step 2) each processor performs N   1 (d
i
)
multiplication/addition operations for dense (sparse) TIGs. Here, d
i
denotes the degree of
vertex i in TIG. During the parallel SA-MVP computation (step 4), each processor performs
K multiplication/addition operations for both dense and sparse TIGs since the D matrix is a
dense matrix. Each processor performs the same constant amount of arithmetic operations in
the remaining steps (steps 5-7 and steps 9-12). Hence, the parallel computational complexity of
29
the proposed algorithm is O(N +K) and O(d
avg
+K) for dense and sparse TIGs respectively.
Hence, linear speed-up can easily be achieved if communication overhead remains negligible.
Note that, the number of concurrent communications increases with the diameter of POG (e.g.
log
2
K, K
1=2
), whereas, computational granularity per processor increases with the number of
processors (K) of POG. Hence, percent communication overhead will reduce with increasing
number of processors. Thus, the proposed parallel algorithm is expected to scale even on
medium-to-coarse grain multicomputers.
7 Conclusion
In this paper, recently proposed Mean Field Annealing (MFA) algorithm is formulated for
the mapping problem. An ecient implementation scheme is also developed for the proposed
algorithm. The performance of the proposed algorithm is evaluated in comparison with two
well known heuristics (Simulated Annealing (SA) and Kernighan-Lin (KL)) for a number of
randomly generated mapping problem instances. The qualities of the solutions obtained by the
MFA and SA heuristics are found to be superior to the qualities of the solutions obtained by
the KL heuristic. Execution time of the MFA algorithm is comparable to that of the ecient
KL heuristic. The SA heuristic produces slightly better solutions than the MFA algorithm,
whereas MFA is signicantly faster. An ecient parallel algorithm is also developed for the
proposed MFA heuristic.
References
[1] Arora, R. K., and Rana, S. P., Heuristic algorithms for process assignment in distributed
computing systems. Information Processing Letters, vol. 11, no. 4-5, pp. 199-203, 1980.
[2] Aykanat, C.,

Ozguner, F., Ercal, F., and Sadayappan, P. Iterative algorithms for solution of
large sparse systems of linear equations on hypercubes. IEEE Transactions on Computers,
vol. 37, no. 12, pp. 1554-1567, 1988.
[3] Behnam-Guilani, K. Fast decoupled load ow: the hybrid model. IEEE Transactions on
Power Systems, vol. 3, no. 2, pp. 734-739, 1988.
30
[4] Bokhari, S. H. On the mapping problem. IEEE Trans. Comput., vol. 30, no. 3, pp. 207-214,
1981.
[5] Bultan, T., and Aykanat, C. Parallel mean eld algorithms for the solution of combinatorial
optimization problems. Proc. ICANN-91, vol. 1, pp. 591-596, 1991.
[6] Bultan, T., and Aykanat, C. Circuit Partitioning Using Parallel Mean Field Annealing
Algorithms. Proc. 3rd IEEE Symposium on Parallel Processing, 1991.
[7] Ercal, F., Ramanujam, J., and Sadayappan, P. Task allocation onto a hypercube by recur-
sive mincut bipartitioning. J. Parallel Distrib. Comput. vol. 10, pp. 35-44, 1990.
[8] Fiduccia, C. M., and Mattheyses, R. M. A linear heuristic for improving network partitions.
Proc. Design Automat. Conf., pp. 175-181, 1982.
[9] Hopeld, J. J., and Tank, D. W. `Neural' Computation of Decisions in Optimization Prob-
lems. Biolog. Cybern., vol. 52, pp. 141-152, 1985.
[10] Hopeld, J. J., and Tank, D. W. Computing with neural circuits: a model. Science, Vol.
233, pp. 625-633, August 1986.
[11] Hopeld, J. J., and Tank, D. W. Collective computation in neuronlike circuits, Scientic
American, vol. 257, no. 6, pp. 104-114, 1987.
[12] Indurkhya, B., Stone H. S., and Xi-Cheng, L. Optimal partitioning of randomly generated
distributed programs. IEEE Trans. Software Engrg., vol. 12, no. 3, pp. 483-495, 1986.
[13] Kasahara, H., and Narita, S. Practical multiprocessor scheduling algorithms for ecient
parallel processing. IEEE Trans. Comput., vol. 33, no. 11, pp. 1023-1029, 1984.
[14] Kernighan, B. W., and Lin, S. An ecient heuristic procedure for partitioning graphs. Bell
Syst. Tech. J., vol. 49, pp. 291-307, 1970.
[15] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Optimization by simulated annealing.
Science, vol. 220, pp. 671-680, 1983.
[16] Lee, S. Y., Chiang, H. D., Lee, K. G., and Ku, B. Y. Parallel power system transient
stability analysis on hypercube multiprocessors. IEEE Transactions on Power Systems,
vol. 6, no. 3, pp. 1337-1343.
31
[17] Peterson, C., and Anderson, J. R. Neural networks and NP-complete optimization prob-
lems; a performance study on the graph bisection problem. Complex Syst. vol. 2, pp. 59-89,
1988.
[18] Peterson, C., and Soderberg, B. A new method for mapping optimization problems onto
neural networks. Int. J. Neural Syst., vol. 1, no. 3, 1989.
[19] Ramanujam, J., Ercal, F., and Sadayappan, P. Task allocation by simulated annealing.
Proc. International Conference on Supercomputing. Boston, MA, May 1988, vol. III, Hard-
ware & Software, pp. 475-497.
[20] Sadayappan, P., and Ercal, F.Nearest-neighbour mapping of nite element graphs onto
processor meshes. IEEE Trans. Comput. vol. 36, no. 12, pp. 1408-1424, 1987.
[21] Sadayappan, P.,Ercal, F., and Ramanujam, J. Cluster partitioning approaches to mapping
parallel programs onto a hypercube. Parallel Computing. vol. 13, pp. 1-16, 1990.
[22] Shield, J. Partitioning concurrent VLSI simulation programs onto a multiprocessor by
simulated annealing. IEEE Proc. Part G, vol. 134, no. 1, pp. 24-28, 1987.
[23] Van den Bout, D. E., and Miller, T. K. A Traveling Salesman Objective Function That
Works. IEEE Int. Conf. Neural Nets, vol. 2, pp. 299-303, 1988.
[24] Van den Bout, D. E., and Miller, T. K. Improving the performance of the Hopeld-Tank
neural network through normalization and annealing. Biolog. Cybern., vol. 62, pp. 129-139,
1989.
[25] Van den Bout, D. E., and Miller, T. K. Graph partitioning using annealed neural networks.
IEEE Trans. Neural Networks, vol. 1, no. 2, pp. 192-203, 1990.
32
