Abstract. The two-way stripes partition mapping and the greedy assignment mapping are proposed to map finite element graphs composed of a number of rectilinear four-node elements on hypercubes. The two-way stripes partition mapping is a two-phase mapping approach. In the first phase a two-way stripes partition heuristic is used to lower the communication cost. In the second phase the load transfer heuristic is used to balance the computational load among processors. The greedy assignment mapping tries to minimize the communication cost and balance the computational load of processors simultaneously. Our simulation results show that the speedups for the two-way stripes partition mapping are better than those for the greedy assignment mapping when the load balancing criterion is achieved in both approaches (that is, the number of nodes in each processor is at most one more than the number of nodes in any other processor). However, the greedy approach performs well at a much lower cost.
Introduction
In parallel computing it is important to map a parallel program on a parallel computer such that the total execution time of the parallel program is minimized. In general, a parallel program and a parallel computer can be represented by a task graph and a processor graph, respectively. For a task graph, nodes represent tasks of a parallel program and edges denote the data communication needed between tasks. The weights associated with the nodes and edges represent the computational load and communication cost, respectively. For a processor graph, nodes and edges denote processors and communication channels, respectively. By using the graph model, we change the mapping problem into a task allocation problem.
In the task allocation problem we try to distribute the computational load of a parallel program to the processors of a parallel computer as evenly as possible (the load balance criterion) and minimize the communication cost of processors (the minimum communication cost criterion). The optimal assignment of tasks to processors in order to minimize the total execution time is known to be NP-complete [Garey and Johnson 1979] . This means that the optimal solution is intractable. Therefore, satisfactory suboptimal solutions are generally sought.
In this paper we discuss how to mapfinite element graphs (FEGs) on hypercubes. Our schemes are general and are applicable to a wide variety of processor graphs. The finite *The work of this author was supported in part by NSF under contract CCR-9110812. element method is widely used for the structural modeling of physical systems [Lapidus and Pinder 1983] . Due to the properties of compute-intensiveness and compute-locality, it is very attractive to implement this method on parallel computers [Aykanat et al. 1987; Berger and Bokhari 1987; Bokhari 1981; Jordan 1978; Sadayappan and Ercal 1987] . The number of nodes in an FEG is usually greater than the number of processors in a parallel computer. It is important to partition an FEG into M modules, where M is the number of processors of a parallel computer, such that the computational load of modules is equal and the communication cost among modules is minimized.
In [Berger and Bokhari 1987] binary decomposition was used to partition a nonuniform mesh graph (a kind of FEG) into modules such that each module has the same computational load. These modules were then mapped on meshes, trees, and hypercubes. This method does not try to minimize the communication cost. Sadayappan and Ercal [1987] proposed nearest-neighbor mapping to map planar FEGs on processor meshes. This approach used the stripes partition (stripes mapping) strategy to minimize the communication cost among processors and then used the boundary refinement heuristic to balance the computational load among processors. All of the FEGs used by these mapping approaches are planar graphs that consist of a number of rectilinear four-node elements.
We propose two mapping approaches, the two-way stripes partition mapping and the greedy assignment mapping, which can be applied to FEGs that are composed of a number of rectilinear four-node elements (not restricted on planar graphs). The two-way stripes partition mapping tries to minimize the communication cost by assigning a node and its neighbor nodes of an FEG to the same processor or neighbor processors of a hypercube (the definitions of neighbor node and neighbor processor will be given later). Since the computational load may not be assigned equally to each processor by this approach, the load transfer heuristic is used to balance the computational load among processors. The greedy assignment mapping tries to minimize the communication cost and balance the computational load simultaneously. It assigns one node of an FEG to a particular processor of a hypercube at a time according to the current status of the neighbor nodes of that node.
In this paper we assume that an FEG consists of a number of rectilinear four-node finite elements (for an example, see Figure 1 ). Many applications use FEGs with each finite element having more than four nodes. The techniques developed in this paper can be easily extended to such FEGs. However, we do not present any results for the performance of our mapping heuristics on such FEGs.
Let E, F, and N be the number of edges, the number of finite elements, and the number of nodes in a finite element graph, respectively. For example, the FEG shown in Figure  1 has N = 40, F = 25, and E = 64. For our analysis we assume that each node is part of at most k finite elements, where k is a constant. Thus, the number of nodes with which each node has to communicate is bounded above by a constant. With the above assumption, E is O(N). It is also easy to see that F is O(N) (since at most four nodes are associated with it). It should be noted that the above assumption is only necessary for determining the computational complexity of our mapping heuristics.
The computational complexities of the two-way stripes partition mapping and the greedy assignment mapping under the above assumption are O(N21og2M) and O(Nlog2M + NlogN), respectively, where M is the number of processors of a hypercube and N is the | | number of nodes of an FEG. Our simulation results show that the speedups for the twoway stripes partition mapping are better than those for the greedy assignment mapping when the load balancing criterion is achieved in both approaches (that is, the number of nodes in each processor is at most one more than the number of nodes in any other processor). However, the greedy approach performs well at a much lower cost. This paper is organized as follows. Section 2 introduces the definitions and notations used in this paper. The computational model of mapping an FEG on a hypercube is described in Section 3. The two-way stripes partition mapping and the greedy assignment mapping are addressed in Sections 4 and 5, respectively. In Section 6 we compare the mapping results of these two approaches.
Preliminaries

Hypercubes
Hypercubes or n-cubes are highly concurrent, loosely coupled multiprocessors based on the binary n-cube network and are referred to by different names, such as the Cosmic Cube [Seitz 1985 ], n-cube [Hayes and Mudge 1986] , and the binary n-cube [Bhuyan and Agrawal 1984] . DEFINITION 1. An n-dimensional hypercube Qn, for n > 1, can be recursively defined in terms of the graph product • as follows [Harary 1969] :
where K2 = Q1 is the complete two-node graph.
From Definition 1 we know that an n-dimensional hypercube consists of 2 n processors. The address of each processor can be represented by an n-bit binary number ranging from 0to2 n -1. DEFINITION 2. In an n-cube two processors, Px and py, are adjacent if the address ofpx differs from that of py by one bit.
In Figure 2 n-dimensional hypercubes are shown, for n = 1, 2, and 3. We use M to denote the total number of processors of a hypercube throughout this paper.
Finite Element Graphs
The finite element method is widely used to solve partial differential equations by using either a direct approach or an iterative approach. In the finite element model an object can be viewed as an FEG, which is a connected and undirected graph that consists of a number of finite elements. In this paper we assume that an FEG consists of a number of rectilinear four-node finite elements. Many applications use FEGs with each finite element having more than four nodes. However, the techniques developed in this paper can be easily extended to such FEGs.
DEFINITION 3. In an FEG two nodes, node(x) and node(y), are adjacent if (node(x), node(y))
is an edge of the FEG. DEFINITION 4. In an FEG two nodes, node(x) and node(y), are neighbors if node(x) and node(y) are in the same finite element.
In Figure la , for example, a 40-node FEG consisting of 25 finite elements is shown.
Let FE(x) denote the set of nodes that form finite element x, ADJ(node(y)) denote the set of adjacent nodes of node(y), NB(node(y) ) denote the set of neighbor nodes of node (y), and #(NB(node(y) )) denote the cardinality of NB(node(y)), that is, the number of nodes in NB(node(y) ). We have FE( 6) = {node(7), node (8), node( 14 ) , node(15) }, ADJ(node(14 ) ) = {node(7), node(13), node(15), node(19)}, NB(node(14) ) = {node(6), node(7), node (8), node(13), node(15), node(18), node(19), node(20)}, and #(NB(node(14) 
Figure 2. An example of n-cubes, for n = 1, 2, and 3.
that ADJ(node(y) For our analysis we assume that each node is part of at most k finite elements, where k is a constant. This assumption is true for almost all FEGs. Thus, the number of nodes with which each node has to communicate is bounded above by a constant. With the above assumption, E is O(N). It is easy to see that F is O(N) (since at most four nodes are associated with it). This assumption also implies that, for each node, node(y), of an FEG, #(ADJ(node(y))) and #(NB(node(y))) is bounded above by a constant. The above properties for E, F, #(ADJ(node(y))), and #(NB(node(y))) will also be valid for FEGs with each finite element having more than four nodes. However, the constants in the big oh notation will be larger.
The Computational Model of Mapping FEGs on Hypercubes
In the context of parallelizing a finite element modeling program that uses iterative techniques to solve system equations [Aykanat et al. 1987] , the parallel program may be viewed as a collection of processes or tasks represented by the nodes of an FEG. Each node represents a particular amount of computation and can be executed independently. In this paper we assume that each processor goes through a computation phase followed by a communication phase. In general, it is possible to overlap communication with computation; however, we restrict ourselves to the previous assumptk, n. The communication needed between the nodes in the FEG of Figure la is shown in Figure lb . We use N to denote the number of nodes of an FEG throughout this paper.
To map an N-node FEG on an M-processor hypercube, we need to assign the nodes of an FEG to the processors of a hypercube. There are M N mapping ways. The total execution time of an FEG on a hypercube under a particular mapping MAP i is defined as follows:
where Tp~r(MAPi) is the total execution time, loadi(pj) is the computational load assigned to processor pj, Tta~ k is the time needed to execute a task on a processor, and Ci(P) is the communication cost of processors under mapping MAPi, where i = 1 ..... M N and j=0 ..... M-1. The computational load assigned to each processor of a hypercube is equal to the number of the nodes of an FEG assigned to it. The processor with the maximal computational load determines the computational cost of the mapping. The first part (of the right-hand side) of Equation 2 reflects this cost. The second part of the right-hand side reflects the communication cost. This cost is estimated by assuming a synchronous mode of communication in which each processor goes through a computation phase followed by a communication is defined as follows:
where S is the number of steps to finish the data communication among processors, Tsetu p is the setup time of the I/O channel, maxj{ckl } is the maximal amount of data sent from Pk to Pl in step j, and T C is the data transmission time of the I/O channel per byte.
Let Tse q denote the total execution time of an FEG on a 0-cube that contains only one processor. The speedup of a mapping MAPi is defined as
The objective of mapping an FEG on a hypercube is to minimize the total execution time, that is, min{Tpar(MAPi) } , or maximize the speedup, that is, max{SpeedUp(MAPi) }, where i = 1, 2, ..., M N. From Equation 2 we know that the processor with the maximal computational load and the communication cost of processors determine the total execution time of an FEG on a hypercube under a particular mapping. Since our main objective is to minimize these quantities, we can do this in three ways: (1) First minimize the communication cost, then balance the computational load; (2) first balance the computational load, then minimize the communication cost; and (3) minimize the communication cost and balance the computational load simultaneously.
According to [Aykanat et al. 1987 ], a desirable mapping would produce a minimal number of communications per processor and balance the load among processors. If we assign the four nodes of a finite element to different processors of a hypercube, there exists at least one pair of nodes in a finite element such that the communication distance of this pair of nodes in a hypercube is greater than or equal to two. To achieve the minimum communication cost criterion, the communication distance should be less than or equal to two. Therefore, in this paper we consider only those mappings in which the communication distance between neighbor nodes of an FEG in a hypercube is less than or equal to two. DEFINITION 5. In an n-cube any two processors whose addresses differ by at most two bits are neighbor processors. DEFINITION 6. A mapping is a neighbor mapping if any two neighbor nodes (nodes corresponding to a finite element) of an FEG are assigned to the same processor or two neighbor processors of a hypercube.
An I/O channel between two adjacent processors, Pi and pj, of a hypercube is a bidirectional channel ifpi and pj can send data to each other simultaneously; otherwise, it is a unidirectional channel. If the I/O channel used in a hypercube is bidirectional (the bidirectional communication model), the following algorithm, bidirectional_comm__cost, is used to compute the value of Ci(P):
/* X is the intermediate processor matrix. u xij E X, if Pi = an-1 9 9 9 ak+lakak-a 9 9 a0 and pj = b,,_ 1 ... bk+lfikak_ 1 ... a0, then xij = an-1 9 9 ak+lhkak-1
1. Compute the communication cost matrix C according to a particular mapping;
Ci(P) = O;
/* For the neighbor mapping, this loop is executed at most twice */ Figure 3 . Here S is equal to 2, max1 {Ckl} = c01 + c03 = c10 + c12 = c21 + c23 = C30 -b C32 = 2, and maxa{ckl } = c02 = Cl3 = c20 = ca1 = 1. We can derive that
while (3
If the I/O channel used in a hypercube is unidirectional (the unidirectional communication model), algorithm unidirectional__comm cost is used to compute the value of Ci(P). The communication between processors in algorithm unidirectional__comm cost has two phases, phase 0 and phase 1. Let the addresses of processors Pi, Pl, Pro, and pj be an_ 1 9 9 ak+la~k-1 . 9 ao, an-I 9 . 9 ak+l{tkak-1 9 9 9 ao, bn-1 9 9 bk+~akak-~ 9 9 9 ao, and bn_ 1 9 9 bk+lfi~k-1 9 9 -a0, respectively. This algorithm first computes the communication cost matrix C according to a particular mapping. Initially toggle = 0. In phase 0, if there exists a cij > 0 and ak = toggle, then set the direction of channeljl from Pi to Pl and send data c 0 from Pi to Pl. In phase 1, if a channel between processors Pm and pj is still available after phase 0 is performed, cji > 0 and a k = toggle; then set the direction of channeljm from pj to Pm and send data cji from pj to Pm. After phases 0 and 1 are performed, update C, set toggle = (toggle + 1) mod 2, and release all channels. Continue phases 0 and 1 until cij = 0. Algorithm unidirectional_comm__cost follows.
algorithm unidirectional comm cost(X)
/* X is the intermediate processor matrix, v xij ~ X, if pi = a,-1 9 9 9 ak+lakak-1 9 9 9 ao and pj --bn-1 9 9 9 bk+l~lkak-1 9 9 9 ao, then x U = an_ 1 . 
{ q c U > O, 0 <_ i, j <_ M -1, Pi ----an-1
9 9 9 ak+lakak-1 9 9 9 ao and pj = b,-1 9 9 9 bk+lakak-1 9 9 9 ao, xij = Pt = a,-1 9 9 9 ak+l{tkak-1 9 9 9 ao, Pi = an-1 9 . bk+lakak-1 9 9 9 ao, and xji = Pt = bn-1 9 9 9 bk+lakak-1 . 9 9 ao, and ak = 
The Two-Way Stripes Partition Mapping
The two-way stripes partition mapping is a two-phase mapping approach. In the first phase (partition and allocation phase), the two-way stripes partition heuristic and the stripes merge are used to partition an N-node FEG into M modules, each module containing m tasks, where 0 < m _ N. These modules are assigned to processors by using the binary reflected Gray code. Since the computational load may not be equally assigned to each processor in this phase, we will try to balance the computational load among processors by using the load transfer heuristic in the second phase (the load balancing phase).
Phase 1: The Two-Way Stripes Partition and Stripes Allocation
The basic approach used in the two-way stripes partition mapping to partition an FEG into modules is stripes partitioning. It starts at an arbitrary node, node(x), of an FEG and labels it as 0. Next, the neighbor nodes of node(x), NB(node(x)), are labeled as 1. This process continues until each node in the FEG is assigned a label. This approach is more general than the stripes partition approach used in [Sadayappan and Ercal 1987] , which can only be used to Partition planar FEGs and has some restrictions. Our approach removes these restrictions and can also be used to partition nonplanar FEGs. The two-way stripes partition uses the stripes partition method twice. For an N-node FEG there are N 2 ways to do the partition. If we try all N 2 ways, the complexity of the mapping algorithm would be high. Therefore, an efficient method to select the two starting nodes such that a better speedup can be achieved is needed. The purpose of the two-way stripes partition is to partition an FEG into blocks so that each block has the same number of nodes. If the two starting nodes are the same or adjacent, the result is a poor partition. Thus the two starting nodes must somehow be far apart from each other. If a stripes partition is applied to an FEG and node(i) is the starting node, the FEG may ideally look like a diamond-shaped graph with node(i) as the top node. Then, the bottom node, node(j), is furthest from node(i). If node(i) and node(j) are the two starting nodes, it will still lead to a poor partition. A better selection would be node(i) and a node, node(k), with label = LL1/2A or VLl127 , that is, node(k) is halfway away from node(i), where L1 is the label assigned to node(j). The simulation shows that, in general, using node(i) and node(k) as the starting nodes leads to a better speedup than using node (O and node(j) . Since the number of nodes with label = LL1/2A or rLl/27 is proportional to N, if we try all the candidate nodes, the complexity of the mapping algorithms is still high. To reduce the complexity, we can randomly select a candidate node.
In our simulation the nodes in an FEG are numbered from top to bottom and left to right (see Figure la) . We choose node(l) and node( LN/2J + 1) as the starting nodes; node( IN/2A + 1) is approximately halfway away from node(l). It will be seen in Section 6 that our selection method can meet the load balancing criterion for many cases. By using the two-way stripes partition, we can denote the labels assigned to each node by a 2-tuple (ll, /2) , where li and/2 denote the labels assigned to a node by the first and second stripes partitions, respectively. The next step is to assign these nodes to processors according to their labels. By using the two-way stripes partition, the 2-tuple labels assigned to nodes imply the following lemma. To assign nodes to processors according to their labels, we need to "flatten" an n-cube into two dimensions. For any two neighbor processors, processor (il, Jl) and processor(i2, J2), in a mesh, we have ]ia -i2] -< 1 and ]j~ -J2[ -< 1. To map an FEG on a mesh, the neighbor mapping can be easily achieved by assigning node(i) with labels (li,, li) to processor (li, , Ii2) . Since an n-cube can emulate 1 • 2 n, 2 • 2 n-1 .... , 2 n x 1 meshes, we try all these cases. A binary reflected Gray code [Chan and Saad 1986 ] is defined as follows: 
where 0 < i _< 2 x -1, 0 _ j _< 2 y -1 and ^ is the binary string concatenation operation.
An example of embedding a 2 • 4 mesh in a 3-cube by using Equation 6 is shown in Figure 5e . Here the addresses of processor (O, 2) and processor(l, 0) are NI(0)^N2(2) = 011 and NI(1)^N2(0) = 100, respectively. With the use of binary reflected Gray codes, the addresses of any two adjacent processors and any two neighbor processors of a mesh differ by 1 and 2 bits, respectively, when the mesh is embedded in a hypercube. Let La b represent the number of nodes whose labels are equal to b in the a-th stripes partition, where a = 1 or 2. Let L1 and L2 represent the largest label numbers assigned by the first and the second stripes partition, respectively. Assume that a 2 x • 2 y mesh is embedded in a (x + y)-cube by using Equation 6. If 2 x -1 < L 1 (2 y -1 < L2), we will merge the two adjacent stripes m and m + 1 (n and n + 1), which minimize L~" + L~ n+l (L~ + L~ +1) for all m = 0, ..., La -1 (for all n = 0, .... L2 -1). This merging process continues until L1 = 2 x -1 (L2 = 2 y --1). The minimization of L'~ + L• +1 (/_~ + L~ +1) is to make every stripe have an approximately equal number of nodes after merging so that a better partition may be obtained. This will lead to a better load balancing. The computational complexity of this merge process is equal to O(NZ). After this merge processing, every node in an FEG is assigned new 2-tuple labels (l~, l~), where 0 __ l~ _< 2 x -1 and 0 _< l~ _< 2 y -1. Then we assign nodes with new labels to processors of a (x + y)-cube according to the following equation:
where the 2-tuple labels (l[, l~) are the new labels assigned to node(i), 0 <_ l~ <_ 2 x -1 and 0 __. l~ ___ 2 y -1. An example of partitioning an FEG into stripes and assigning stripes to a 3-cube is shown in Figure 5a-g. (a) The labels assigned to nodes by P (b) The labels assigned to nodes by the the first stripes partition.
(~(~_~(2.3) (C) The 2-tuple labels assigned to nodes. The algorithm of the two-way stripes partition and allocation is given below:
algorithm two way__stripes___partition_allocation(row, col) /* row and col denote the length and width of a mesh, respectively. */ 1. Calculate the adjacent and neighbor nodes of each node in an FEG. 2. Apply the first stripes partition. 3. Apply the second stripes partition. 4. Merge stripes produced by the first and second stripes partition if necessary. 
Assign nodes to processors according to their new labels by using Equation 7. end of two_way_~stripes__.partition__allocation
Phase 2: Load Balancing
The objective of the load balance phase is to balance the computational load assigned to processors in the first phase while preserving the neighbor mapping property. 
Nx(O)ANy(1).
This process continues until the balanced load for processor Nx(2 x -1)ANy(2 y -1) has been computed. 
Let load(Nx(i)ANy(j)) denote the number of nodes assigned to Nx(i)^Ny(j), right(Nx(i)^Ny(j)) denote the right adjacent processor of Nx(i)^Ny(j), that is, Nx(i)ANy(j + 1), and down(Nx(i)^Ny(j)) denote the down adjacent processor of Nx(i)ANy(j), that is, Nx(i + 1)ANy(j). Note that processors Nx(i)ANy(2 y -1) and Nx(2 x -1)ANy(j)
do
Rule2: load(Nx(i)ANy(j)) < N/M. If load(down(Nx(i)^Ny(j))) > load(right(Nx(i)ANy(j))), then Nx(i)^Ny(j) must receive one node from down(Nx(i)^Ny(j)); otherwise, Nx(i)^Ny(j) must receive one node from right(Nx(i)^Ny(j)).
We update the load of processors and continue to apply rule 2 until load(Nx(i)^Ny(j)) = N/M. For those processors that do not have right or down processors, the load of their right or down processors is equal to -oo. 
(MN).
In the second step we transfer the load from one processor to another using load transfer matrix A. The algorithm proceeds iteratively, in an incremental manner, and is similar to that of [Sadayappan and Ercal 1987] . Initially, all the processors that must transfer nodes to other processors are put into a queue Q, and all the processors in Q are marked as active. There are two iterative loops. Let ND(pi ) denote the set of nodes assigned to processor Pi. In the first iterative loop, for every active processor Pi in Q, if pj must receive a node from Pi, we try to transfer a boundary node, node(x), of ND(pi) to pj; that is,
node(x) E ND(pi), node(y) ~ ND(pj), and node(x) E NB(node(y))
, while preserving the neighbor mapping property. In the second iterative loop, for every active processor Pi in Q, ifpj must receive a node from pi, we try to transfer a node, node(x), of ND(pi) (not restricted on a boundary node) to pj while preserving the neighbor mapping property.
These two iterative loops alternate until the load is balanced or further balancing is im- 
Algorithm Ioad___transfer is not always guaranteed to balance the computational load of processors. If the computational load of processors can be balanced by this algorithm, the values of all the elements in A are equal to zeros. In Figure 5h we show the mapping result of Figure 5g after the load transfer heuristic has been applied.
The two-way stripes partition mapping algorithm follows: LEMMA 2. The two-way stripe partition mapping is a neighbor mapping.
Greedy Assignment Mapping
Greedy assignment mapping, which is a heuristic approach, assigns a node to a particular processor according to the current status of its neighbor nodes. Initially, it assigns node(a), which has the largest number of adjacent nodes in an FEG, to processor 0, and the adjacent nodes of node(a) are put into a queue Q. The node node(i) in Q, which has the largest number of adjacent nodes, is selected as the next node to be assigned. Let P (NB(node(i)) ) denote the set of processors to which the neighbor nodes of node(i) are assigned and P(POS(node(i))) denote the set of processors whose addresses differ from the address of each processor in P(NB(node(i))) by at most 2 bits. IfP (POS(node(i)) ) is empty, it implies that the neighbor mapping is impossible for this approach; otherwise, for all Px, Py P(POS(node(i))) and load(px) < Ioad(py), node(i) is assigned to p~. The adjacent nodes of node(i) are then inserted in Q. This process continues till all the nodes are assigned or the neighbor mapping is impossible. The algorithm is given as follows. 
while (Q # 0) do
{ node(i) = root(H(Q))
; /* the node with the largest number of adjacent nodes in Q */ 8.
Compute P(POS(node(i))).
if (P(eOS(node(i))) = 0)
then stop ("The neighbor mapping is impossible"); Figure 6 . Mapping an FEG on a hypercube by using the greedy assignment mapping.
Performance Evaluation and Simulation Results
The samples of FEGs tested in this paper consist of four planar graphs and three nonplanar graphs, which are shown in Figure 7a -d and Figure 7e -g, respectively. The number of nodes of these FEGs ranges from a few tens to a few hundreds. According to the communication models described in Section 3, we derive the estimated lower bound speedup (ELBS) and the estimated upper bound speedup (EUBS) for both of the bidirectional and unidirectional communication models to measure our mapping results. They are given as follows: The estimated upper and lower bound speedups are obtained by assuming that both the load balancing criterion and the neighbor mapping are achieved. If the load balancing criterion is achieved by a mapping, the item max {loadi(pj)} in Equation 2 is equal to rN/M7 . If a mapping is a neighbor mapping, the best case of the communication cost is that any two neighbor nodes of an FEG are assigned to the same processor or two adjacent processors of a hypercube and every processor need only send two nodes' data to each of its adjacent processors (see Figure 8) . According to the communication models described in Section 2.3, we can derive Equations 8.1 and 9.1.
If a mapping is a neighbor mapping, the worst case of the communication cost is that any two neighbor nodes of a finite element are assigned to two processors whose addresses differ by 2 bits in a hypercube. For the bidirectional communication model the maximal number of steps to finish the data communication among processors is equal to two. In step 1 a processor receives data from its adjacent processors and sends data to its neighbor processors simultaneously. The maximal amount of data sent by the processors is equal to logM x FN/Mq . In step 2 the maximal amount of data sent from a processor to its adjacent processors is equal to (logM -1) x rN/M-] (see Figure 3) . Therefore, we can A processor may receive (or send) data from (or to) its adjacent processors in step 3 and then send (or receive) data to (or from) its adjacent processors in step 4. The maximal amount of data sent by the processors in steps 3 and 4 are both equal to logM x FN/Mq -1 (see Figure 4) . Therefore, we can derive Equation 9.2.
In our simulation we make the following assumptions about the capabilities of the processors of a hypercube [Sadayappan and Ercal 1987] . Tta~k is equal 119/~s, Tsetu p is equal to 115 /zs, and T c is equal to 1 ps per byte. The simulation program was written in C and was run on a 386SX PC. For the two-way stripes partition mapping and the greedy assignment mapping the speedups for each test sample are simulated on a 3-cube, 4-cube, and 5-cube. In the simulation the running time of the mapping programs for each test sample on an n-cube is calculated by using the clock functions provided in the C library, where n = 3, 4, and 5 (see Table 1 ). The total execution time for each test sample on a 0-cube is equal to N x Tto~. In the simulation the total execution time for every test sample on an n-cube is calculated by using equations 2 and 3, where n = 3, 4, and 5. The speedup for every test sample on an n-cube is calculated by using Equation 4, where n = 3, 4, and 5. As mentioned in Section 4.1, the selection of the two starting nodes for the two-way stripes partition is very important. The effectiveness of our selection method (node(l) and node( LN/2] + 1)) is compared to that of, first, node(l) and the nodes with label = LL1/2J or [L1/2 7 (Method 1) and, second, node(l) and the nodes with label = L1 (Method 2), where node(l) is the starting node of the first stripes partition and L1 is the largest label number assigned by the first stripes partition. We did try all the candidate nodes in Method 1 and Method 2. For each test sample on an n-cube, where n = 3, 4, and 5, the speedups of our selection method, Method 1, and Method 2 are shown in Tables 2, 3 , and 4, respectively. The following conclusions can be drawn from Tables 1-4. 1. The greedy assignment mapping, in general, can produce a good mapping at a low computation cost. This method is not restricted to hypercubes and can be applied to a wide variety of parallel architectures. It fails to preserve the neighbor mapping for sample 5 on the 4-cube and 5-cube. Since every node in sample 5 has the same number of adjacent nodes, it is difficult for this algorithm to determine which node is the best node to be assigned next. By including a failure recovery mechanism (such as allowing two neighbor nodes to be assigned to two processors whose addresses differ by more than 2 bits) in algorithm greedy assignment__mapping, we can obtain the speedups shown in Table 2 . The neighbor mapping property, however, will no longer be preserved. 2. For the cases when the load balancing criterion is achieved, the speedups for the twoway stripes partition mapping are better than those for the greedy assignment mapping. For example, for the 3-cube the speedups of the two-way stripes partition mapping and the greedy assignment mapping are close to the estimated upper bound speedups and the estimated lower bound speedups, respectively. The reason is that, by using algorithm nodes is not so important because the load balancing criterion can be achieved by our selection method, Method 1, and Method 2 (see Tables 2-4) . However, for the 5-cube the number of the test samples that meet the load balancing criterion using our selection method is greater than that using Method 2 and less than that using Method 1. Note that the complexity of the two-way stripes partition for Method 1 is O(N) times that of our selection method.
Conclusion
We have discussed two mapping approaches, the two-way stripes partition mapping and the greedy assignment mapping, to map FEGs on hypercubes. The two-way stripes partition mapping uses the stripes partition and binary reflected Gray codes allocation to achieve the minimum communication cost criterion and uses the load transfer heuristic to achieve the load balancing criterion. The greedy assignment mapping uses a greedy heuristic to achieve both the minimum communication cost criterion and load balancing criterion. The cost models of mapping an FEG on a hypercube are developed for the bidirectional and the unidirectional communication models. Four planar and three nonplanar FEGs are used as the test samples. To measure the mapping results, we derived the estimated upper and lower bound speedups for both communication models. The simulation results show that the speedups for the two-way stripes partition mapping are better than those for the greedy assignment mapping when the load balancing criterion is achieved in both approaches. However, the greedy approach performs well at a much lower cost.
