The most important issue in sequential program parallelisation is the efficient assignment of computations into different processing elements. In the past, too many approaches were devoted in efficient program parallelization considering various models for the parallel programs and the target architectures. The most widely used parallelism description model is the task graph model with precedence constraints. Nevertheless, as far as physical mapping of tasks onto parallel architectures is concerned, little research has given practical results. It is well known that the physical mapping problem is NP-hard in the strong sense, thus allowing only for heuristic approaches. Most researchers or tool programmers use exhaustive algorithms, or the classical method of simulated annealing. This paper presents an alternative approach onto the mapping problem. Given the graph of clustered tasks, and the graph of the target distributed architecture, our heuristic finds a mapping by first placing the highly communicative tasks on adjacent nodes of the processor network. Once these «backbone» tasks are mapped, there is no backtracking, thus achieving low complexity. Therefore, the remaining tasks are placed beginning from those close to the «backbone» tasks. The paper concludes with performance and comparison results which reveal the method's efficiency.
Introduction
In the last years the evolution in the fields of VLSI technology and computer networking has given raise to the utilization of distributed computing systems. Distributed computers are attractive to many demanding applications since they provide the user with modularity, scalability and low cost decentralized processing power. Nevertheless distributed computing has a sound drawback which discourages application developers to use it. The existence of many parallel processors is not fully exploited because of the interprocessor communication overhead. The degradation of optimal speedup when the number of processors increases is caused by the excessive sometimes amount of messages between non-neighboring cells.
Researchers have already focused on such lowperformance problems and proposed various methodologies to the optimal parallelization of the given programs. Experience has shown that an effective solution to the general task graph scheduling problem onto a given architecture is a multistep approach which includes the task graph description of the problem, efficient scheduling of the tasks onto a virtual fully connected parallel machine, merging tasks into larger clusters and finally assigning clusters into the physical processor topology. As far as the task graph scheduling with precedence constraints is concerned, the most widely used model is the directed acyclic graph with edges and nodes having various weights. Node weights represent computation time for the corresponding tasks, while edge weights are communication requirements between tasks. Each edge direction gives a set of precedence constraints which should be preserved. The general task graph scheduling problem with communication delays is NP-COMPLETE as presented in [9] . Sarkar [10] , Gerasoulis and Yang [3] presented good heuristics for solving the task graph scheduling problem with arbitrary communication costs.
All previous researchers were supposing unbounded or bounded number of processors with CLIQUE topology. This hypothetical topology is far away from the realistic case. Modern distributed memory parallel systems are organized in different topologies, including hypercube, mesh or ring architectures. There is a strong need for efficient placement of clustered tasks into the real processor network. The mapping of virtual into real topologies is called the physical mapping problem or the task allocation problem.
The task allocation problem can be optimally solved in special cases such as two-processor distributed systems, or linear array of any number of processors. If the target architecture contains two processors, then the task allocation problem is stated as a maximum flow minimum cut problem [11] which can be polynomially solved using, for example, the Ford-Fulkerson algorithm. There also exists a heuristic presented [7] , which addresses the general m-processor problem using the 2-way min cut algorithm m times.
Most of the theoretical work on mappings considers structured graphs like grids, hypercubes, trees, etc [8] . An increasing number of applications demand methods dealing with irregular graphs. The general mapping problem is unfortunately NP-complete, thus allowing only for efficient heuristics.
Our paper deals with the intractable problem of cluster allocation onto parallel architectures, by proposing an alternative approach to the above mentioned methodologies. We model any target parallel architecture by a non-directed graph with unitary edge weights only. This representation retains the meaningful only details of the target. Next, the graph of tasks is depicted as a nondirected graph with various edge costs and node weights resulting from clustering heuristics as the one proposed by Gerasoulis in [3] . The task assignment procedure starts by tracing the clusters which are most likely to be placed in neighboring places, where the maximum number of links is available. This step ends by placing these backbone clusters into these places. For each of the backbone clusters, the algorithm creates sets of neighboring clusters which are candidate for placement into the adjacent cells of every preassigned backbone cluster. The algorithm ends when all clusters are assigned.
In the remainder of this paper the following are presented and further elaborated: Section II reviews the multistep approach of task graph scheduling with precedence constraints using a specific topology distributed architecture. Section III presents the proposed graph model for the parallel architecture and the utilized cost function for evaluating different physical mappings. In section IV we outline the steps of the proposed algorithm. Finally section V the proposed algorithm is compared to other approaches in terms of efficiency and the outperforming results are shown.
The Multistep Approach
The general scheduling problem of an arbitrary task graph with communication delays onto a fixed size and connection pattern distributed architecture is NP-COMPLETE. El-Rewini et H. Ali in [1] , [2] proved this NP completeness by representing the problem of task allocation onto distributed system with a split graph. The task allocation is therefore equivalent to a weighted clique graph partitioning which is NP-COMPLETE, thus proving inherent intractability. In order to find efficient methods, researchers have followed a multistep approach, where each step addresses a limited instance of the general problem. The successive steps are outlined as follows: task clustering, cluster merging and physical mapping.
Task Clustering -Scheduling a task graph with
communication delays onto a bounded/unbounded CLIQUE of processors.
First the task graph with computation/communication costs and precedence constraints is scheduled onto a fully connected network of processors. The classical CLIQUE architecture is therefore used as a target having limited or unlimited number of processors [4] . Researchers, in this first step, propose algorithms which minimize the maximum makespan, disregarding the actual processor's topology. Even when the makespan metric is to be minimized, the scheduling problem remains NP-COMPLETE in the majority of general cases. Papadimitriou et Yannakakis in [9] have proved that the classical scheduling problem of a task graph with arbitrary communication and computation times is NP-COMPLETE. They proposed a 2-optimal approximation algorithm. In addition to this, Sarkar in [10], Gerasoulis in [3] proposed faster heuristics with acceptable performance. All these algorithms perform the same initial step: Cluster the tasks into large nodes, so that the grain of the parallelism is increased and the use of distributed processors is minimizing the task graph makespan. The resulting topology, by applying this initial step is a graph of clusters. Inside every cluster there exist several tasks which are time scheduled in the same processor and have zero intracommunication overhead. As far as clusters intercommunication is concerned, this is the summation over all tasks intercommunication overhead for all clusters. This organises the set of tasks into clusters with the following property: Each cluster contains all tasks which are to be executed on the same processor. This step is called task clustering.
Cluster Merging into p physical clusters
In this step, the set of clustered tasks is mapped onto a clique of bounded number of processors. Since the set of clustered tasks is larger than the number of available processors, this step assigns two or more clusters to the same processor. Sarkar in [10] has proposed a scheduling heuristic with O(|V|(|V|+|E|)) complexity, where |V| stands for the number of nodes and |E| for the number of edges. A lower complexity heuristic which is used in PYRROS [12] is the work profiling method. It merges clusters which have approximately the same arithmetic load.
Physical Mapping of p physical clusters onto p network connected processors.
While so much effort has been done concerning the first steps, the final step of cluster allocation onto the physical processor topology has not given considerable attention. Few researchers such as Bokhari have presented some heuristics. Many scheduling tools which such as OREGAMI or PYRROS or PARALLAX use heuristics or approximation algorithms which are sometimes efficient. For example Gerasoulis et Yang in Pyrros [12] use Bokhari's heuristic which is based onto simulated annealing. This algorithm starts from an initial assignment, then performs a series of pairwise interchanges by reducing the cost function and stops after O(p 3 ) steps.
In Oregami's [6] MAPPER, a tool for task allocation onto distributed architectures, a greedy heuristic is used, called the NN-Embed. It currently supports only mesh and hypercube processor networks thus limiting its potential use. Even in the case of those two architectures, it uses a rather simple method, by listing all edges in ascending order of their weights and assigning them to the processor's network edges.
The NN-Embed Algorithm
Given a task graph, it first constructs a list of all the edges in the graph, sorted by weight. The heuristic then traverses this list in linear time and for each edge, assign endpoints as follows:
• If both nodes have already been assigned, do nothing • If only one node has been assigned, then assign the other node to the closest free processor • If neither node has been assigned, randomly choose a free processor and assign one node to it and the other to its closest free neighbor. The sorting step needs O(|E|log|E|) time, and the rest is O(|E|). It is a fast heuristic, but it is limited to hypercube and mesh topologies.
Obviously, it has poor performance in many cases because it is based on single-edge adjacency, and it does not take into account any set of adjacent nodes. In Example 5, we apply both our method and the NN-Embed, and the outperforming results of the proposed heuristic are shown.
In PARALLAX [5] , all task allocation heuristics assume fully connected networks of processors. Only the mapping heuristic considers arbitrary processor interconnection topology. It uses a modified list scheduling technique with priorities, by taking into account the communication delays. The node's priority in the list is its level.
In [8] , Monien and Sudborough reviewed results on mapping specific task graphs into the most popular parallel architectures.
Graph Models
Our problem considers two graphs: The graph of virtual processors, or equivalently the graph of clustered tasks, defined as G c (V c , E c ) and the graph of physical processors, defined as G p (V p , E p ).
Definition 3.1: Consider two graphs and G w (V w , E w ) where V is the set of nodes and E is the set of edges with |V w | = |V p | , we define the following function F m :V w →V p as the physical mapping function :
∀
)∈E p Let us define some topological parameters for the processor graph/network:
• Hop is the unit distance between any two directly connected processors in the network • Distance between two processor nodes u and v is called the number of hops in the shortest path connecting u and v. Distance is represented by dist(u,v).
The following formula defines the cost function used to evaluate the different mappings:
Cost Function for a mapping F m :
where:
• dist( F m (v w ), F m (u w ) ) is the shortest path in the processor graph between the 2 processors-nodes where the tasks v w and u w are mapped.
• comm(v w , u w ) is the total communication cost between the tasks v w ,u w The following properties characterize the efficiency of a mapping from G w to G p :
Edge-dilation is the maximum distance in G p an edge of G w has to be routed. Formally:
Edge-congestion is the maximum number of edges from G w routed via an arbitrary edge (or node) of G p
In most theoretical works dealing with the physical mapping problem, minimizing edge dilation and edge congestion are the primary goals. Our objective is therefore to find the mapping F m which minimizes CF m , or formally: Obviously, F m1 is better than F m2 in terms of total completion time, since it imposes less communication overhead than F m2 . The least communication is achieved by mapping F m3 shown below: 
The Physical Mapping Algorithm (PMAP)
This heuristic tries to find the most communication intensive task-nodes and map them and their neighbors into neighboring processing nodes on the processor graph. At the beginning, the processor graph is analyzed and, for each node, the number of adjacent nodes is calculated. Subsequently, the nodes of the task graph are sorted by ascending order of their total communication weight and number of neighbors. The heuristic then places the most demanding task-nodes in terms of total communication links and cost to the respective nodes of the processor graph. Once these core task nodes are placed, there is no backtracking. Next, the heuristic places the adjacent of the core nodes to adjacent cells of the processor graph, by making locally best-fit comparisons. The algorithm performs the following steps:
First Phase:
1 st step: Adjust the maximum number of neighbors in the task-graph to the maximum number in the processor graph, by removing the less communication cost edges for each task-node.
2 nd step: Sort the nodes of the processor graph by ascending order of their neighboring links and the nodes in the cluster graph G c by the number of communication links they need.
3 rd step: Assign the most demanding task-node to the best suitable node of the processor graph and his neighboring nodes to the best suitable neighboring nodes of the processor graph.
Second Phase:
4 th step: Place back the edges removed at the first step. 5 th step: From the processors not allocated yet find the one who neighbors with the most processors already assigned. 6 th step: Find a task that is distanced by at most i (at the beginning i=1) from all the tasks that are mapped to the neighboring processors of the processor found on the previous step. If found assign this task to the processor of the 5 th step. 7 th step: Repeat steps 5-6 until no more assignments can be made. If there are more processors to be allocated, increase i by one and go back to the 5 th step. 
Details of the algorithm

Example
Consider the G c (V c , E c ) with c ij costs shown in figure  7 , and the processor graph G p (V p ,E p ) in figure 8: 
NN-Embed algorithm
Sorting of Edges in descending order of their weight: 
Performance Results
We report our experiments on the PMAP algorithm. For the sake of comparisons we have created a program which, given the task and the processor graphs, generates all possible mappings and calculates the total communication time based on the metric of section III. We have also created a random task graph generator with various communication-edge weights and number of tasksnodes and a graph generator based on fixed topologies. We have also applied the PMAP heuristic and the NN_Embed heuristic on the same graph pairs. The results obtained are shown in the table below. We can see that the PMAP algorithm gives better results than the NN_Embed algorithm, especially when the task and processor graphs have fixed topologies. But even when the graphs have been randomly created PMAP outperforms NN-Embed by more than 10%. 
Related Work
There are not many systems which tackle with the problem of task allocation on specific target architectures, thus solving the physical mapping problem. PYRROS, developed by Yang and Gerasoulis, uses the heuristic of Bokhari which is based on simulation annealing. In OREGAMI [6] task allocation is performed in three steps: contraction of the task graph to a smaller graph, assignment of the contracted clusters of tasks to processors and finally routing of messages through the interconnection network to minimize contention. In OREGAMI, there exist canned algorithms for typical 1f 2d 3e 4b 5h 6g
7a 8c
interconnection patterns concerning the target parallel architectures, which have been developed a priori. If the interconnection pattern is not in the above library, the NN-Embed greedy heuristic is used. PARALLAX [5] , is another scheduling tool which contains a task allocation heuristic assuming a specific interconnection topology for the target architecture.
Conclusion
In this paper a new approach was presented for the physical mapping of task graphs into parallel architectures having arbitrary interconnection topologies. It is obvious the general problem is intractable thus allowing only for efficient heuristics. The proposed algorithm has a low complexity and gives very good results for graphs with equilibrated communication loads between adjacent tasks. This new algorithm was implemented and tested, and the outperforming results were reported. Future work includes the introduction of additional criteria which would increase the heuristic's performance as far as noncounterbalanced graphs are concerned.
Acknowledgements
