A new mapping heuristic is developed, based on the recently proposed M e an Field Annealing (MFA) algorithm. An e cient implementation scheme, which decreases the complexity of the proposed algorithm by asymptotical factors, is also given. Performance of the proposed M F A algorithm is evaluated i n c omparison with two wellknown heuristics Simulated A nnealing and Kernighan-Lin. Results of the experiments indicate that MFA c an be u s e d as an alternative heuristic for solving the mapping problem. Inherent parallelism of MFA is exploited by designing an e cient parallel algorithm for the proposed M F A heuristic.
1 Introduction among clusters 21] . The problem solved in the clustering phase is identical to the multi-way graph partitioning problem. In the one-to-one mapping phase, each cluster is assigned to an individual processor of the multicomputer such that the total inter-processor communication is minimized 21]. Kernighan-Lin (KL) 8, 14] and Simulated Annealing (SA) 15] heuristics are two attractive algorithms widely used for solving the mapping problem 7, 19, 21, 22] .
Heuristics proposed to solve the mapping problem are compute intensive. Solving the mapping problem can be considered as a preprocessing performed before the execution of the parallel program on the parallel computer. Sequential execution of the mapping heuristic may i n troduce unacceptable preprocessing overhead, limiting the e ciency of the parallel implementation. E cient parallel mapping heuristics are needed in such cases. The KL and SA heuristics are inherently sequential, hence hard to parallelize. E cient parallelizations of these algorithms remain as important issues in parallel processing research.
In this work, a recently proposed algorithm, called Mean Field Annealing (MFA) 18, 24, 25] is formulated for the many-to-one mapping problem. MFA c o m bines the collective computation property of Hop eld Neural Networks (HNN) with the annealing notion of SA. It is originally proposed for solving traveling salesperson problem, as a working alternative to HNN 23] . MFA is also a general strategy as SA, and can be applied to di erent problems with suitable formulations. Previous works on MFA 5, 6, 17, 18, 24, 25] show that it can be successfully applied to various combinatorial optimization problems. MFA has the inherent parallelism that exists in most of the neural network algorithms. Section 2 presents a formal de nition of the mapping problem by modeling parallel program design process. In Section 3, general formulation of the MFA heuristic is presented. Section 4 presents the proposed formulation of the MFA algorithm for the mapping problem. An e cient implementation scheme for the proposed algorithm is also described in this section. Section 5 presents the performance evaluation of the MFA algorithm for the mapping problem in comparison with two w ell known mapping heuristics SA and KL. Finally, e cient parallelization of the MFA algorithm for the mapping problem is proposed in Section 6.
The Mapping Problem
In various classes of problems, interaction pattern among the tasks is static. Hence, the decomposition of the algorithm can be represented by a static task graph. Vertices of this graph represent the atomic tasks and the edge set represents the interaction pattern among the tasks. Relative computational costs of atomic tasks can be known or estimated prior to the execution of the parallel program. Hence, weights can be associated with the vertices in order to denote the computational costs of the corresponding tasks.
Two di erent models, Task Precedence Graph (TPG) and Task Interaction Graph (TIG), are used for modeling static task interaction patterns 13, 20] . TPG is a directed graph where directed edges represent execution dependencies. Each edge denotes a pair of tasks source and destination. The destination task can only be executed after the completion of the execution of the source task. In general, only the subsets of tasks which are unreachable from each other in TPG can be executed independently.
In the TIG model, interaction patterns are represented by undirected edges between vertices. In this model, each atomic task can be executed simultaneously and independently. Each e d g e denotes the need for the bidirectional interaction between corresponding pair of tasks at the completion of the execution of these tasks. Edges may be associated with weights which denote the amount of bidirectional information exchange involved between pairs of tasks. TIG usually represents the repeated execution of the tasks with intervening task interactions denoted by the edges.
The TIG model may seem to be unrealistic for general applications since it does not consider the temporal interaction dependencies among the tasks 20]. However, there are various classes of problems which can be successfully modeled with the TIG model. For example, iterative solution of systems of equations arising in nite element applications 2, 20] and power system simulations 3, 16] , and VLSI simulation programs 22] are represented by TIGs. In this paper, problems which can be represented by the TIG model are addressed.
In order to solve the mapping problem, parallel architecture mu s t a l s o b e m o d e l e d i n a w ay that represents its architectural features. Parallel architectures can easily be represented by a Processor Organization Graph (POG), where nodes represent the processors and edges represent the communication links. In fact, POG is a graphical representation of the interconnection topology utilized for the organization of the processors of the parallel architecture. In general, nodes and edges of a POG are not associated with weights since most of the commercially available multicomputer architectures are homogeneous with identical processors and communication links. (1) while maintaining the computational load (CL p : computational load of processors p)
of each processor balanced. Here, M(i) = p denotes the label (p) of the the processor that task i is mapped to. In Eq. (1), each edge (i j) o f t h e G T contributes to the communication cost (CC), only if vertices i and j are mapped to two di erent nodes of the G P , i.e. M(i) 6 = M(j). The amount of contribution is equal to the product of the volume of interaction e ij between these two tasks and the unit communication cost d pq between processors p and q where p = M(i) a n d q = M(j). The computational load of a processor is the summation of the weights of the tasks assigned to that processor. Perfect load balance is achieved if CL p = ( P N i=1 w i )=K for each p, 1 p K. Computational load balance of the processors can be explicitly included in the cost function using a term which is minimized when all processor loads are equal. Another scheme is to include load balance criteria implicitly in the algorithm. Figure 1 , numbers inside the circles denote the vertex labels, and numbers within the parenthesis denote the vertex or edge weights. Binary labeling of the 2-dimensional hypercube is also given in Figure 1 
(2) (1)
(1)
(1) 
Mean Field Annealing
Mean Field Annealing (MFA) merges collective computation and annealing properties of Hopeld Neural Networks (HNN) 9, 10, 11] and Simulated Annealing (SA) 15], respectively, t o obtain a general algorithm for solving combinatorial optimization problems. HNN is used for solving various optimization problems and reasonable results are obtained for small size problems 9]. However, simulations of this network reveals the fact that it is hard to obtain feasible solutions for large problem sizes. Hence, the algorithm does not have a good scaling property, which i s a v ery important performance criterion for heuristic optimization algorithms. MFA i s proposed as a successful alternative to HNN 18, 23, 24, 25] . In the MFA algorithm, problem representation is identical to HNN 9, 23, 24 ], but iterative s c heme used to relax the system is di erent. MFA can be used for solving a combinatorial optimization problem by c hoosing a representation scheme in which the nal states of the spins can be decoded as a solution to the target problem. Then, an energy function is constructed whose global minimum value corresponds to the best solution of the problem to be solved. MFA is expected to compute the best solution to the target problem, starting from a randomly chosen initial state, by m i n i m izing this energy function.
The MFA algorithm is derived by making an analogy to Ising spin model which is used to estimate the state of a system of particles or spins in thermal equilibrium. This method was rst proposed for solving the traveling salesperson problem 23] and then it is applied to the graph partitioning problem 5, 6, 17, 25] . Here, general formulation of the MFA algorithm 25] is given for the sake of completeness. In the Ising spin model, the energy of a system with S spins has the following form:
Here, kl indicates the level of interaction between spins k and l, a n d s k 2 f 0 1g is the value of spin k. It is assumed that kl = lk and kk = 0 f o r 1 k l S. A t thermal equilibrium, spin average hs k i of spin k can be calculated using Boltzmann distribution as follows 23] hs k i = 1 1 + e ; k =T (4) Here, k = hH(s)ij s k =0 ; h H(s)ij s k =1 represents the mean eld e ecting on spin k, where the energy average hH(s)i of the system is hH(s)i = The complexity of computing k using Eq. (5) 
Thus, the complexity of computing k reduces to O(S).
At each temperature, starting with initial spin averages, the mean eld e ecting on a randomly selected spin is computed using Eq. (7). Then, spin average is updated using Eq. (4). This process is repeated for a random sequence of spins until the system is stabilized for the current temperature. The general form of the MFA algorithm derived from this iterative relaxation scheme is shown in Figure ( 2). The MFA algorithm is used to nd the equilibrium point o f a system of S spins using an annealing process similar to SA.
HNN and SA have a major di erence SA is an algorithm implemented in software, whereas HNN is derived with a possible hardware implementation in mind. MFA is somewhere in between, it is an algorithm implemented in software, having potential for hardware realization 24, 25] . In this work MFA is treated as a software algorithm. Performance of MFA i s comparable to other software algorithms as SA and KL, conforming this point of view.
Mean Field Annealing for the Mapping Problem
In this section, we propose a formulation of the Mean Field Annealing (MFA) algorithm for the mapping problem. The TIG and PCG models described in Section 2 are used to represent the mapping problem. The formulation is rst presented for problem instances modeled by dense TIGs. The modi cations in the formulation for the mapping problem instances that can be modeled by sparse TIGs are presented later. In this section, we also present a n e c i e n t implementation scheme for the proposed formulation.
Formulation
A spin matrix, which consists of N task-rows and K processor-columns, is used as the representation scheme. Figure 1 can be represented by the following N K spin matrix. Following energy (i.e., cost) function is proposed for the mapping problem
s ip s jp w i w j (8) Here, e ij denotes the edge weight b e t ween the pair of tasks i and j, and w i denotes the weight of task i in TIG. Edge weight b e t ween processors p and q in PCG is represented by d pq . U n d e r the mean eld approximation, the expression hH(s)i for the expected value of the cost function will be similar to the expression given for H(s) in Eq. (8) . However, in this case, s ip , s iq and s jp should be replaced with hs ip i, hs iq i and hs jp i respectively. F or the sake of simplicity, s ip is used to denote the expected value of spin (i p) (i.e., spin average hs ip i).
In Eq. (8), the term s ip s jq denotes the probability that task i and task j are mapped to two di erent processors p and q, respectively. Hence, the term e ij s ip s jq d pq represents the weighted interprocessor communication overhead introduced due to the mapping of tasks i and j to di erent processors. Note that, in Eq. (8) Using the mean eld approximation described in Eq. (7) In a feasible mapping, each task should be mapped exclusively to a single processor. However, there exists no penalty term in Eq. (8) (10) This normalization enforces the summation of each r o w of the spin matrix to be equal to unity. Hence, it is guaranteed that all rows of the spin matrix will have only one spin with output value 1 when the system is stabilized.
Eq. (9) can be interpreted in the context of the mapping problem as follows. First double summation term represents the increase in the total interprocessor communication cost by mapping task i to processor p. Second summation term represents the increase in the computational load balance cost associated with processors p by mapping task i to processor p. Hence, ; ip may be interpreted as the decrease in the overall solution quality b y mapping task i to processor p. Then, in Eq. (10), s ip is updated such that the probability o f t a s k i being mapped to processor p increases with increasing mean eld ip experienced by s p i n ( i p). Hence, the MFA heuristic can be considered as a gradient-descent t ype algorithm in this context. However, it is also a stochastic algorithm, similar to SA, due to the random spin update scheme and the annealing process.
In the general MFA algorithm given in Figure 2 , a randomly chosen spin is updated at a time.
However, in the proposed formulation of MFA for the mapping problem, K spins of a randomly chosen row of the spin matrix are updated at a time. Mean elds ip , ( 1 p K) experienced by the spins at the i-th row of the spin matrix are computed using Eq. (9) for p = 1 2 : : : K . Then, the spin averages s ip 1 p K are updated using Eq. (10) for p = 1 2 : : : K . E a c h row update of the spin matrix is referred as a single iteration of the algorithm.
The system is observed after each spin-row update in order to detect the convergence to an equilibrium state for a given temperature 24]. If energy function H does not decrease after a certain number of consecutive spin-row updates, this means that the system is stabilized for that temperature 24]. Then, T is decreased according to the cooling schedule, and iteration process is re-initiated. Note that, the computation of the energy di erence H necessitates the computation of H (Eq. (8) ), which drastically increases the complexity of one iteration of MFA. Here, we propose an e cient scheme which reduces the complexity of energy di erence computation by an asymptotical factor. The incremental energy change H ip due to the incremental change s ip in the value of an individual spin (i p) i s H = H ip = ip s ip (11) from Eq. (7). Since, H(s) i s l i n e a r i n s ip (see Eq. (8) (12) At each iteration of the MFA algorithm, K spin values are updated in a synchronous manner.
Hence, Eq. (12) is valid for all spin updates performed in a particular iteration. Thus, energy di erence due to the spin-row update operation in a particular iteration can be computed as
ip s ip (13) where s ip = s new ip ; s old ip . The complexity of computing Eq. (13) Here, Adj(i) denotes the set of tasks connected to task i in the given TIG. Note that, sparsity of TIG can only be exploited in the mean eld computations since spin update operations given in Eq. (10) are dense operations which are not e ected by the sparsity of TIG. 
An E cient Implementation Scheme
As is mentioned earlier, the MFA algorithm proposed for the mapping problem is an iterative process. The complexity of a single MFA iteration is mainly due to the mean eld computations. In this section, we propose an e cient implementation scheme which reduces the complexity o f the mean eld computations, and hence the complexity o f t h e M F A iteration, by asymptotical factors.
Assume that, i-th spin-row is selected at random for update in a particular iteration. The expression given for ip (Eq. (9) 
Here, iq represents the increase in the interprocessor communication by mapping task i to a processor other then q (for the current mapping on processor q), assuming uniform unit communication cost between all pairs of processors in PCG. Similarly, ip represents the increase in the computational load balance cost associated with processor p, b y mapping task i to processors p (for the current mapping on processor p).
For an e cient implementation, the overall mean eld computations involved in a single iteration can be computed using the following matrix equation 
The complexity analysis of the proposed implementation scheme for dense TIGs is as follows. This section presents the performance evaluation of the Mean Field Annealing (MFA) algorithm for the mapping problem, in comparison with two w ell-known mapping heuristics Simulated Annealing (SA) and Kernighan-Lin (KL). Each algorithm is tested using randomly generated mapping problem instances. Following sections brie y present the implementation details of these algorithms.
MFA Implementation
The MFA algorithm ( Figure 3 ) described in Section 4 is implemented in order to evaluate its performance. Cooling process is started from an initial temperature which is found experimentally. It is not feasible to search for an initial temperature for each problem instance, as this process may take more time than solving the original problem. In order to avoid this, we performed experiments for only a small number of instances and chose an initial temperature which w orks for each one. For the mapping problem instances used in these experiments, initial temperature was found to be T 0 = 5 :0. This value for T 0 is used for all 26 mapping problem instances involved in the experiments.
Coe cient r, which determines the balance between two optimization criteria of the mapping problem, is computed at the beginning of the MFA algorithm. After the spins are initialized randomly, r is computed using these initial spin values as r = P N i=1 P j6 =i P K p=1 P q6 =p e ij s ip s jq d pq K P N i=1 P j6 =i P K p=1 s ip s jp w i w j (23) As is seen from the equation, r is used for balancing of the two summation terms in the cost function. Note that, r is inversely proportional to the number of processors.
At each temperature, iterations continue until H < for L consecutive iterations where L = N initially. P arameter is chosen to be 0:5. Cooling process is realized in two phases slow cooling followed by fast cooling, similar to the cooling schedules used for SA 18] . In the slow cooling phase, temperature is decreased using = 0 :9 u n til T is less than T 0 =1:5. Then, in the fast cooling phase, L is set to L=4 a n d is set to 0:5 and cooling is continued until T is less then T 0 =5:0. At the end of this cooling process, maximum spin values at each r o w are set to 1 and all other spin values are set to 0. Then the result is decoded as described in Section 4, and the resulting mapping is found. Note that, all parameters used in this implementation are either constants or found automatically. Hence, there is no parameter setting problem.
Kernighan-Lin Implementation
Kernighan-Lin heuristic is not directly applicable to the mapping problem since it was originally proposed for graph bipartitioning. The two phase approach is used to apply the KL heuristic to the mapping problem. In the rst phase, TIG is partitioned into K clusters, where K is equal to the number of processors. These K clusters are then mapped to PCG using a one-to-one mapping heuristic in the second phase. One-to-one mapping heuristic used in this work is a variant of the KL heuristic.
For the clustering phase, Kernighan-Lin heuristic is implemented e ciently as described by
Fiduccia and Mattheyses 8]. Two di erent s c hemes are utilized to apply KL to K-way g r a p h partitioning. First scheme, partitioning by recursive bisection (KL-RB), recursively partitions
the initial graph into two partitions until K partitions are obtained. Other scheme, partitioning by pairwise min-cut (KL-PM), starts with an initial K-way partitioning and then iteratively minimizes the cutsizes between each pair of partitions until no improvement c a n b e a c hieved.
In the KL heuristic, computational load balance is maintained implicitly by the algorithm. Vertex (task) moves causing intolerable load imbalances are not considered.
In the beginning of the second phase, K clusters formed in the rst phase are mapped to the K processors of the multicomputer randomly. After this initial mapping, communication cost is minimized by performing a sequence of cluster swaps between processor pairs.
Simulated Annealing Implementation
The SA algorithm, implemented for solving the mapping problem, uses the one phase approach to map TIG onto PCG. In simulated annealing, starting from a randomly chosen initial con guration, con guration space is searched for the best solution using a probabilistic hill climbing algorithm. A con guration of the mapping problem is a mapping between TIG and PCG, which assigns each task in TIG to a processor in PCG. In order the search the con guration space, neighborhood of a con guration must be de ned. For the implementation in this work, neighborhood of a con guration consists of all con gurations which results with moving one vertex (task) of TIG from the maximum loaded node (processor) of PCG to any other node of PCG. At e a c h iteration of the simulated annealing algorithm, one of the possible moves is chosen randomly as a candidate move. Then, the resulting decrease in the total communication cost caused by the candidate move is calculated without changing the con guration. If the candidate move decreases the cutsize, it is realized. If it increases the cutsize, then it is realized with a probability which decreases with the amount of increase in the total cutsize. Acceptance probabilities of the moves that increase the cost are controlled with a temperature parameter T which is decreased using an annealing schedule. Hence, as the annealing proceeds acceptance probabilities of uphill moves decrease. An automatic cooling schedule is used in the implementation of the SA algorithm 18].
Experimental Results
In this section, performance of the MFA algorithm is discussed in comparison with the SA and KL algorithms. These heuristics are experimented by mapping randomly generated TIGs onto mesh and hypercube connected multicomputers.
Six test TIGs are generated with N = 200 and 400 vertices. Vertices of these TIGs are weighted by assigning a randomly chosen integer weight b e t we e n 1 a n d heuristics for the generated mapping problem instances. In these tables, N and jEj denote the number of vertices and edges in the test TIGs, respectively, a n d K denotes the number Tables 1, 2 and 3.   Tables 1 and 2 illustrate the quality of the solutions obtained by the KL-RB, KL-PM, SA and MFA heuristics. Total communication cost averages (and standard deviations) of the solutions are displayed in Table 1 , and percent computational load imbalance averages (and standard deviations) are displayed in Table 2 . Percent l o a d i m balance for each solution is computed proportional to the computational load di erence between maximum and minimum loaded processors. Table 3 displays the execution time averages of the KL-RB, KL-PM, SA and MFA heuristics. Table 4 is constructed for a better illustration of the overall performance of the MFA algorithm in comparison with the KL and SA heuristics. For each problem instance, results given in Tables 1, 2 and 3 are normalized with respect to the results of the MFA algorithm. The averages of the normalized results of Table 1, Table 2 and Table 3 constitute the rst, second and fourth rows of Table 4 , respectively. T h e a verage solution quality for each algorithm is computed using
Third row o f T able 4, illustrates solution quality v alue of each algorithm normalized with respect As is seen in Tables 1, 2 and 4, the quality of solutions obtained by the MFA and SA heuristics are superior to those of the KL-RB and KL-PM heuristics. Solutions produced by S A a r e slightly better compared with the solutions produced by M F A, whereas the MFA algorithm is signi cantly faster (23 times on the average). As is seen in Table 3 and 4, average execution time of the MFA algorithm is comparable with that of the e cient KL heuristic. The MFA algorithm is 2:8 times faster than the KL-PM heuristic and 2:5 times slower than the KL-RB heuristic on the average. These results indicate that the proposed MFA algorithm is a promising alternative heuristic for solving the mapping problem.
Parallelization of Mean Field Annealing Algorithm
As is mentioned earlier, heuristic used for solving the mapping problem is a preprocessing overhead introduced for the e cient implementation of a given parallel program on the target multicomputer. If the mapping heuristic is implemented sequentially, this preprocessing can be considered in the serial portion of the parallel program which limits the maximum e ciency of the parallel program on the target machine. For a xed parallel program instance, the execution time of the parallel program is expected to decrease with increasing number of processors in the target multicomputer. However, as is seen in Table 3 , for a xed TIG, the execution time of all mapping heuristics increase with increasing number of processors in the target multicomputer. Hence, the serial fraction of the parallel program will increase with increasing number of processors. Thus, this preprocessing will begin to constitute a drastic limit on the maximum e ciency of the overall parallelization due to Amdahl's Law. Hence, parallelization of these mapping heuristics on the target multicomputer is a crucial issue for e cient parallel implementations.
Unfortunately, parallelization of the mapping heuristics introduces another mapping problem. The computations of the mapping heuristics should be mapped to the processors of the same target architecture. However, in this case, the parallel algorithm for the mapping heuristic should be such that its mapping can be achieved intuitively. F urthermore, the intuitive mapping should lead to an e cient parallel implementation of the mapping heuristic. For these reasons, the target mapping heuristic to be parallelized should involve regular and inherently parallel computations. The MFA algorithm proposed in Section 4 for the general mapping problem has such nice properties for an e cient parallelization. Following paragraphs discuss the e cient parallelization of the proposed mapping heuristic for multicomputers.
Assume that, the MFA algorithm is used to map a given parallel program represented with a TIG having N vertices on a target multicomputer with K processors. The MFA algorithm will use an N K spin matrix for the mapping operation. The question is to map the computations of the MFA algorithm to the same target multicomputer (with the same number of K processors). As is mentioned earlier, the MFA algorithm is an iterative algorithm. Hence, the mapping scheme can be devised by analyzing the computations involved in a particular iteration of the algorithm. Atomic task can be considered as the computations required for updating an individual spin. Note that, K spin averages at a particular row of the spin matrix are updated at each iteration. Hence, these K spin updates can be computed in parallel by mapping each spin in a row of the spin matrix to a distinct processor of the target architecture.
Thus, the N K spin matrix is partitioned column-wise such that each processor is assigned an individual column of the spin matrix. That is, column p of the spin matrix is mapped to processor p of the target architecture. Each processor is responsible for maintaining and updating the spin values in its local column. Assume that, task-i is selected at random in a particular iteration. Then, each processor is responsible for updating the probability of task i being mapped to itself.
A single iteration of the MFA algorithm can be considered as a three phase process, namely, mean eld computation phase, spin update phase, and energy di erence computation phase.
Each processor p should compute its local mean eld value ip (Eq. (9) Hence, the proposed parallel MFA algorithm necessitates three global communication operations due to the GCOL operation involved during the rst phase and two GSUM operations involved in the second and third phases. In ne grain multicomputers, the volume of interprocessor communication is the important factor in predicting the complexity of the interprocessor communication overhead. However, in medium grain multicomputers, the number of communications is also important since high set-up time overhead is associated with each communication step. The set-up time is the dominating factor for short messages in such a r c hitectures. Note that, only a single oating-point v ariable, representing the running sum, is communicated during the GSUM operations involved in the last two phases of the parallel MFA algorithm.
Reducing the number of GSUM operations required in the MFA algorithm will be a valuable asset in achieving e cient implementations on medium grain multicomputers. As seen in Eq. (13) scheme reduces the number of GSUM operation from two to one. Three oating point v ariables, representing the running sums A i , B i , a n d C i , are communicated during the communications involved in the GSUM operation.
The node program (of processor p, for 1 p K) for a single iteration of the parallel MFA algorithm proposed for solving the mapping problem is given in Figure 4 . Note that, variables with \ip" and \p" subscripts denote the local variables. Variables with \i" subscripts denote the global variables which are constructed and duplicated at the local memory of each processor after performing the indicated global operations. As is seen in Figure 4 , the proposed parallel MFA algorithm exhibits very regular computational structure even for mapping arbitrarily irregular TIGs. The communication structure is also very regular since it necessitates only GSUM and GCOL operations. Hence, the proposed parallel MFA algorithm can easily be implemented on both MIMD and SIMD types of multicomputers. is proportional to the diameter and the number of processors (K) of POG for GSUM and GCOL operations, respectively.
As is seen in Figure 4 , the proposed parallel MFA algorithm achieves perfect load balance. The parallel computational complexity of a single MFA iteration can be obtained as follows. ), whereas, computational granularity per processor increases with the number of processors (K) of POG. Hence, percent c o m m unication overhead will reduce with increasing number of processors. Thus, the proposed parallel algorithm is expected to scale even on medium-to-coarse grain multicomputers.
Conclusion
In this paper, recently proposed Mean Field Annealing (MFA) algorithm is formulated for the mapping problem. An e cient implementation scheme is also developed for the proposed algorithm. The performance of the proposed algorithm is evaluated in comparison with two well known heuristics (Simulated Annealing (SA) and Kernighan-Lin (KL)) for a number of randomly generated mapping problem instances. The qualities of the solutions obtained by t h e MFA and SA heuristics are found to be superior to the qualities of the solutions obtained by the KL heuristic. Execution time of the MFA algorithm is comparable to that of the e cient KL heuristic. The SA heuristic produces slightly better solutions than the MFA algorithm, whereas MFA is signi cantly faster. An e cient parallel algorithm is also developed for the proposed MFA heuristic.
