Mapping Finite Element Graphs on Hypercubes by Chung, Yeh-Ching & Ranka, Sanjay
Syracuse University 
SURFACE 
Electrical Engineering and Computer Science - 
Technical Reports College of Engineering and Computer Science 
1990 




Follow this and additional works at: https://surface.syr.edu/eecs_techreports 
 Part of the Computer Sciences Commons 
Recommended Citation 
Chung, Yeh-Ching and Ranka, Sanjay, "Mapping Finite Element Graphs on Hypercubes" (1990). Electrical 
Engineering and Computer Science - Technical Reports. 95. 
https://surface.syr.edu/eecs_techreports/95 
This Report is brought to you for free and open access by the College of Engineering and Computer Science at 
SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science - Technical Reports by 
an authorized administrator of SURFACE. For more information, please contact surface@syr.edu. 
School of Computer and Information Science 
4-116, CST Center for Science and Technology 
Syracuse University 
Syracuse, NY 13244-4100 
(315) 443-4457 
Mapping Finite Element Graphs on Hypercubes 
Yeh-Ching Chung and Sanjay RanktJ 
School of Computer and Information Science 
4-116, CST Center for Science and Technology 
Syracuse University 
Syracuse, NY 13244-4100 
(315) 443-4457 
Abstract - The 2-way striP.es partition mapping and the greedy assignment mapping are 
proposed to map finite element graphs (FEGs) onto hypercubes. They can be used to map both 
2-D and 3-D PEGs on hypercubes. The 2-way stripes partition mapping is a two phase 
mapping approach. In the first phase, a 2-way stripes partition approach is used to achieve 
low communication cost In the second phase, the load transfer heuristic is used to balance 
the computational load among processors. The greedy assignment mapping tries to minimize 
the communication cost and balance the computational load of processors simultaneously. 
1. INTRODUCTION 
In parallel computing, it is important to map a parallel program onto a parallel computer 
such that the total execution time of a parallel program is minimized. In genera~ a parallel 
program and a parallel computer can be represented by a task graph (TG) and a processor graph 
(PG), respectively. For a TG, nodes represent tasks of a parallel program and edges denote 
the data communication needed between tasks. The weights associated with nodes and edges 
represent the computational load and communication cost, respectively. For a PG, nodes and 
edges denote processors and communication channels, respectively. By using the graph 
model, the mapping problem becomes a task allocation problem. 
In the task allocation problem, we try to distribute the computational load of a parallel 
program to the processors of a parallel computer as evenly as possible (the load balance 
criterion (LBC)) and minimize the communication cost of processors (the minimum 
communication cost criterion (MCCC)). The optimal assignment of tasks to processors in 
order to minimize the total execution time is known to be NP-complete [GaJo79]. This means 
that the optimal solution is intractable. Therefore, satisfactory suboptimal solutions are 
generally sought. 
In this paper, we will discuss how to map finite element graphs (FEGs) onto hypercubes. 
Our schemes are general and are applicable to a wide variety of PGs. The finite element 
method (FEM) is a widely used method for the structural modeling of physical system 
[LaPi83]. Due to the properties of compute-intensiveness and compute-locality, it is very 
attractive to implement this method on parallel computers [BeBo87] [Bokh81] [Jord78] 
[SaEr87]. The number of nodes in a FEG is usually greater than the number of processors 
in a parallel computer. It is important to partition a FEG into M modules such that the 
computational load of modules are equal and the communication cost among modules are 
minimized, where M is the number of processors of a parallel computer. 
In [BeBo87], a binary decomposition approach was used to partition a nonuniform mesh 
graph (a kind ofFEG) into modules such that each module has the same computational load. 
These modules were then mapped onto meshes, trees, and hypercubes. This method does not 
try to minimize the communication cost. [SaEr87] proposed the nearest-neighbor mapping 
1 
approach to map planar FEGs onto meshes. It used the stripes partition (stripes mapping) 
strategy to minimize the communication cost among processors and then used the boundary 
refinement heuristic to balance the computational load among processors. All of the FEGs 
used by those mapping approaches are two dimensional graphs. They cannot be trivially 
extended to three dimensional FEGs. In a structural modeling system, most of the cases 
encountered are three dimensional FEGs. Therefore, it is important to show that a mapping 
approach can be applied to all kinds of FEGs. 
We propose two mapping approaches, the 2-way stripes partition mapping and the greedy 
assignment mapping, which can be applied to all kinds of FEGs. The 2-way stripes partition 
mapping tries to minimize the communication cost by assigning a node and its neighbor nodes 
of a FEG to the same processor or neighbor processors of a hypercube (the definitions of 
neighbor node and neighbor processor will be defined latter). Since the computational load 
may not be equally assigned to each processor by using this approach, the load transfer 
heuristic is used to balance the computational load among processors. The greedy assignment 
mapping tries to minimize the communication cost and balance the computational load 
simultaneously. It assigns one node of a FEG to a particular processor of a hypercube at a 
time according to the current status of the neighbor nodes of that node. 
In our analysis, we assume that the number of edges (E), the number of finite elements 
(F), and the number of nodes (N) differ from each other by a multiplicative constant, i.e., E 
= c1F = c2N, for some constants c1 and c2• These assumptions are true for most of the FEGs. 
The computational complexities of the 2-way stripes partition mapping and the greedy 
assignment mapping are O(MN'llogM) and O(Nlog2 M + NlogN), respectively, where M is 
the number of processors of a hypercube and N is the number of nodes of a FEG. Our 
simulation results show that the speedups for the 2-way stripes partition mapping are better 
than those for the greedy assignment mapping when the LBC is achieved in both approaches. 
However, the greedy approach gives good performance at a much lower cost. 
This paper is organized as follows. Section 2 introduces the definitions and notations 
used in this paper. The cost models of mapping a FEG onto a hypercube are also described 
in this section. The 2-way stripes partition mapping and the greedy assignment mapping are 
2 
addressed in Sections 3 and 4, respectively. In Section 5, we compare the mapping results of 
these two approaches. 
2. PRELIMINARIES 
2.1. Hypercubes 
Hypercubes or n-cubes are highly concurrent loosely coupled multiprocessors based on 
the binary n-cube network and are referred to by different names (such as cosmic cube [Seit85], 
n-cube [HaMu86], binary n-cube [BhAg84], etc.). 
Definition 1 : An n-dimensional hypercube Qn, for n > 1, can be recursively defined in 
terms of the graph product x as follows [Hara69]: 
(1) 
where K2 = Q1 is the complete 2-node graph. • 
From Definition 1, we know that an n-dimensional hypercube consists of zn processors. 
The address of each processor can be represented by an n-bit binary number ranging from 
0 to zn-1. 
Definition 2: In ann-cube, two processors Px and Py are adjacent processors if the address 
of Px differs from that of Py by one bit. • 
In Figure 1, n-dimensional hypercubes are shown, for n = 1, 2, and 3. We use symbol 
M to denote the total number of processors of a hypercube throughout this paper. 
2.2. Finite Element Graphs (FEGs) 
The finite element method (FEM) is a widely used technique to solve the partial differential 
equations (PDEs) by using iterative approach. In the finite element model, an object can be 
viewed as a FEG. A FEG is a connected and undirected graph which consists of a number 
of rectilinear 4-node finite elements (FEs). 
3 
Figure 1 : An example of n-cubes, for n = 1, 2, and 3 
Definition 3 : A FEG is a 2-D FEG if it is a planar graph. • 
Definition 4 : In a FEG, two nodes node(x) and node(y) are adjacent nodes if < node(x), 
node(y) > is an edge of the FEG. • 
Detipjtjop S : In a FEG, two nodes node(x) and node(y) are neighbor nodes if node(x) and 
node(y) are in the same FE. • 
In Figure 2(a), for example, a 40-node FEG which consists of 25 FEs is shown (The circled 
and uncircled numbers denote the FE numbers and node numbers, respectively.). LetFE(x) 
denote the set of nodes which form FE x, ADJ(node(y)) denote the set of adjacent nodes of 
node(y), NB(node(y)) denote the set of neighbor nodes of node(y), and I{NB(node(y))) denote 
the cardinality of NB(node(y)), i.e., the number of nodes in NB(node(y)). We have FE(6) = 
{node(7), node(B), node(14), node{15)}, ADJ(node(14)) = {node{7), node(13), node(15), 
node(19)}, NB(node(14)) = {node{6), node{7), node{8), node{13), node{15), node(18), node{19), 
node(20)}, and #(NB(node(14))) = 8. It is clear thatADJ(node(y)) is a subset ofNB(node(y)), 
i.e., ADJ(node(y)) c NB(node(y)). In this paper, we assume that the number of edges (E), the 
number of finite elements (F), and the number of nodes (N) differ from each other by a 
multiplicative constant, i.e., E = c1F = c2N, for some constants ct and c2. These assumptions 
are true for most of the FEGs. We also assume that the degree of every node in a FEG is upper 
4 
bounded by a constant, i.e. #(ADJ(node(y))) is a constant. This assumption implies that 
#(NB(node(y))) is also a constant. 
In a FEG, a node represents a particular amount of computation. Each node has the 
same computational load and can be executed independently. Each node has to send data 
to its neighbor nodes after completing its computation and all the nodes have to finish their 
communication before they can commence next iteration. The communication needed 
between nodes in the FEG of Figure 2(a) are shown in Figure 2(b ). We use symbol N to denote 
the number of nodes of a FEG throughout this paper. 
28 
34 
(a) : A 40-node FEG with 25 FEs. (b) The communication needed between nodes. 
Figure 2 : An example of a 40-node FEG and the communication need-
ed between nodes. 
2.3. The Cost Models of Mapping FEGs onto Hypercubes 
From the parallel processing point of view, a FEG can be characterized as a task 
interaction graph (TIG) [SaEr87]. In a TIG, nodes represent tasks and edges denote the 
5 
communication needed between tasks. All the tasks can be executed independently and 
simultaneously, i.e., the temporal dependencies of tasks are not represented explicitly. 
To map an N-node FEG onto an M-processor hypercube, we need to assign the nodes 
of a FEG to the processors of a hypercube. There are MN mapping ways. The total execution 
time of a FEG on a hypercube under a particular mapping MAP; is defined as follows: 
Tpar(MAP;) = max{loadi(pj)} x Ttask + C;(P), (2) 
where Tpar (MAPi), /oad;(pj) , Ttask. and Ci(P) represent the total execution time, the 
computational load assigned to processor Pj, the time to execute a task on a processor, and 
the communication cost of processors under mapping MAP;, respectively, where i = 1, ... , MN 
andj = 0, ... , M-1. 
The computational load assigned to each processor of a hypercube is equal to the nodes 
of a FEG assigned to it. Since the processor with the maximal computational load determines 
the computational cost of a mapping, Equation 2 employs the synchronous communication 
mode implicitly, i.e., the communication between processors cannot be started until all the 
processors have completed their computations. 
If we assign the four nodes of a FE to different processors, there exists at least one pair 
of nodes in a FE such that the communication distance of this pair of nodes in a hypercube 
is greater than or equal to 2. In this paper, we consider only mappings such that the 
communication distance between neighbor nodes of a FEG in a hypercube is less than or equal 
to 2. 
Definition 6 : In an n-cube, any two processors whose addresses differ by at most two 
bits are neighbor processors. • 
Definition 7 : A mapping is a neighbor mapping if any two neighbor nodes (nodes 
corresponding to a FE) of a FEG are assigned to the same processor or two neighbor 
processors of a hypercube. • 
From Definitions 6 and 7, we have the following lemma. 
Lemma 1 :To map a FEG onto a 2-cube, any mapping approach is a neighbor mapping.• 
6 
In our communication models, we assume that every processor can communicate with 
all its adjacent processors in one step. Since we use the synchronous communication mode, 
Ci (P) is defined as follows: 
s 
Ci(P) = I (Tsetup + maxj{CkJ} X Tc). 
j=l 
(3) 
where S is the number of steps to finish the data communication among processors, Tsetup is 
the setup time of the I/0 channel, maxi { CkJ} is the maximal amount of data sent from Pk to 
Pt in step j, and Tc is the data transmission time of the I/0 channel per word. An I/0 channel 
between two adjacent processors, Pi and Pj, of a hypercube is a bidirectional channel if Pi and 
Pi can send data to each other simultaneously; otherwise, it is a unidirectional channel. 
If the I/0 channel used in a hypercube is bidirectional (the bidirectional communication 
model), algorithm bidirectional_comm_cost is used to compute the value of C;(P). 
algorithm bidirectional_ comm_ cost(X) 
I* X is the intermediate processor matrix. V xu E X, if Pi = a,_1 •.. ak+ 1a#k-1 •.. a0 and 
Pi= b,_1 ... bk+1li#k-1"·ao, thenxu = a,_1 ... ak+1li#k-1 ... ao *I 
1. Compute the communication cost matrix C according to a particular mapping; 
2. Ci(P) = 0; 
I* For the neighbor mapping, this loop is executed at most twice *I 
3. while ((3 Cab > 0) and (Pa and Pb are neighbor processors)) do 
4. { v cu > 0, 0 s: i, j s: M-1, send data cu from Pi to xu; Update C and Ci(P); } 
5. return(Ci(P)); 
end_ of_ bidirectional_ comm_ cost 
7 
The initialization of the communication cost matrix requires O(Mlog2 M) time*. Line 1 
N 
requires oc_I #(NB(node(i)))) = O(N) time; line 2 requires Ct time; line 3 requires Cz time; 
i=l 
line 4 requires O(Mlog2 M) time, and line 5 requires c3 time, where Ct. c'b and c3 are constants. 
Lines 3 and 4 form a loop and this loop is executed at most twice. The computational 
complexity of this algorithm is equal to O(N + c1 + 2 x (cz + Mlog2 M) + c3) = O(N + 
Mlog2 M). The communication behavior of algorithm bidirectional_comm_cost is shown in 
Figure 3(a). In Figure 3(a), Sis equal to 2, max1{ ckl} = co1 + co3 = c10 + c12 = c21 + c23 = 
c3o + c32 = 2, and maxz{ckl} = coz = c13 = czo = c31 = 1. We can derive that C;(P) = 2 x 
Tsetup + (2 + 1) X Tc = 2 X Tsetup + 3 X Tc. 
If the 1/0 channel used in a hypercube is unidirectional (the unidirectional 
communication model), algorithm unidirectional_comm_cost is used to compute the value of 
algorithm unidirectional_ comm_ cost( X) 
/*X is the intermediate processor matrix. v X;; EX, if Pi = an-1· .. ak+1akfh-1· .• ao and 
Pi= bn-1···bk+1ii#k-1···ao, thenx;; = P1 = a,._1···ak+1ii#k-1···ao *I 
1. Compute the communication cost matrix C according to a particular mapping; 
2. toggle = 0; Ci (P) = 0; 
I* For the neighbor mapping, this loop is executed at most four times *I 
3. while ((3 cab > 0) and (Pa and Pb are adjacent processors)) do 
!* Set the communicating direction of channelu from Pi to P1 if Pi = a,._1 ••• ak+ 1a#k-1 ••• a0 , 
P1 = a,._1 ... ak+1ii#k-1 ... a0 •Pi = b,._1 ... bk+ 1ii#k-1 ... a0 , ak = toggle, and cii > 0 *I 
4. { V cii > 0, 0 S i, j S M-l,p; = a,._1 ... ak+ 1a#k-1 •.. a0 , Pi= b,._1 ... bk+1ii#k-1···ao, 
xii = p1 = a,._1 ••• ak+ 1ii#k-1 ••• a0 , and ak = toggle, 
• Note that each processor can only have O(log2 M) neighbor processors. The other values of the matrix are useless. For 
ease of presentation, Vcii > 0 in the algorithm refers to only c;; in which i andj are neighbor processors. Thus the 
complexity of this operation is O(M log2 M) as compared to obvious O(M2). This assumption is true for the rest of the 
presentation. 
8 
5. if (channelil is available or channelil = Pi - P1) then 
{ channelil = Pi - P1; /*The communicating direction of channelil is set from Pi to Pr *I 
Send data Cij from Pi to P1; Update C and Ci(P); } 
I* If there are some channels channelil are still available after steps 4-5 are executed, set the 
communicating direction of channe~ from Pi to P1 if Pi = b11_ 1 ••• bk+1il,.ak_1 ••• a0 , 
Pt = bn-l···bk+la,.ak-l···ao' Pi = On-1···ak+1a,.ak-l···ao, ak = toggle, and Cji > 0. *I 
6. 'r/Cji > 0, 0 ~ i,j ~ M-1,pi = a,._1 ... ak+tO#k-l···a0 , Pi= h11_1 ••• bk+ 1il,.ak-l···a0 , and 
Xft = p1 = b,._1 ••• bk+la,.ak_1 ••• a 0 , and ak = toggle, 
7. if (channelil is available or channelil = Pi - Pt) then 
{ channelil = Pi - Pt; Send data cii from Pi to Pt; Update C and Ci(P); } 
8 toggle = (toggle + 1) mod 2; 
9. } 
10. return(Ci(P)); 
end_ of_ compute_ comm_ cost 
N 
In algorithm unidirectional_comm_cost, line 1 requires O(L II(NB(node(z)))) = O(N) 
i=l 
time; line 2 requires c1 time; line 3 requires Cz time; lines 5 and 7 require c3 time; line 8 requires 
c4 time; and line 10 requires c5 time, where cb c:o c:» c4, and c5 are constants. Lines 3 to 9, 4 
to 5, and 6 to 7 form loops and these loops have O(c), O(Mlog2 M), and O(Mlog2 M) iterations, 
respectively, where c is a constant. The computational complexity of this algorithm is equal 
to O(N + c1 + c X (cz + Mlog2 M X c3 + Mlog2 M X C3 + c4) + cs) = O(N + Mlog2 M). 
An example of the communication behavior of algorithm compute_comm_cost is shown in 
Figure 3(b). In Figure 3(b), Sis equal to 4, max1{ ckl} = co1 + co3 = c21 + c23 = 2, maxz{ ckl} 
= CIO + C12 = CJO + C32 = C31 = 2, max3{ CJcl} = C02 = C13 = 1, and max.{ CJcl} = Czo = 1. 
We can derive that C;(P) = 4 X Tsetup + (2 + 2 + 1 + 1) X Tc = 4 X Tsetup + 6 X Tc. 
Let Tseq denote the total execution time of a FEG on a 0-cube which contains only one 
processor. The speedup of a mapping MAP; is defined as follows: 
9 
step1 step2 C-= 1011 [ 011 ~ 1101 ----··~ C2Q ---........:·· c20 1110 
~001~ 01 C= 1000 
0100 
~000~ 0 C= 0000 
0000 
Figure 3(a): The communication behavior of algorithm bidirectional_comm_cost. 
[ 011 ~ C= 1011 1101 
1110 
toggle -= 0 
... 
step 1 
~000~ 0 C= 0000 
0000 
toggle = 1 
o4 
step 4 
C= 0000 ~ ~ 1000 0000 
step 3 + toggle -= 0 
Figure 3(b): The communication behavior of algorithm unidirectional_comm_cost . 
. - Tseq SpeedUp(MAP,) - T pm(MAP;) (4) 
The objective of mapping a FEG onto a hypercube is to minimize the total execution time, 
i.e., min{Tpar(MAP;)}, or maximize the speedup, i.e., max{SpeedUp(MAP;)}, where i = 1, 2, 
10 
... , MN. From Equation 2, we know that the processor with the maximal computational load; 
and the communication cost of processors determine the total execution time of a FEG on 
a hypercube under a particular mapping. Since our main objective is to minimize these 
quantities, there are three ways to achieve the objective of a mapping. (1) First minimize 
communication cost, then balance the computational load. (2) First balance the 
computational load, then minimize the communication cost. (3) Minimize the communication 
cost and balance the computational load simultaneously. The 2-waystripes partition mapping 
and the greedy assignment mapping adopt approaches (1) and (3), respectively. 
3. THE 2-WAY STRIPES PARTITION MAPPING 
The 2-way stripes partition mapping is a two phase mapping approach. In the first phase 
(partition and allocation phase), it uses the 2-way stripes partition heuristic and stripes merge 
to partition anN-node FEG into M modules and each module contains m tasks, where 0 ~ 
m ~ N. These modules are assigned to processors by using the binary reflected Gray code 
(BRGC). Since the computational load may not be equa11y assigned to each processor in this 
phase, we will try to balance the computational load among processors by using the load 
transfer heuristic in the second phase (the load balancing phase). 
3.1. Phase I: The 2-way Stripes Partition and Stripes Allocation 
The basic approach used in the 2-way stripes partition to partition a FEG into modules 
is the stripes partition approach. The stripes partition approach starts at an arbitrary node 
node(x) of a FEG and labels it as 0. Next, the neighbor nodes of node(x), NB(node(x)), are 
labeled as 1. This process continues till each node in a FEG is assigned a label. Our approach 
is more general than the stripes partition approach of [SaEr87]. The approach proposed in 
[SaEr87] can only be used to partition 2-D FEGs and has some restrictions. Our approach 
removes the restrictions in [SaEr87] and can be used to partition any kind of FEGs. The 2-way 
11 
stripes partition uses the stripes partition method twice. The partitioning starts at node(1) 
and node( l-TJ + 1), respectively. By using this method, the labels assigned to each node can 
be denoted by a 2-tuple (It, 12), where It and lz denote the labels assigned to a node by the first 
and second stripes partition, respectively. 
The next step is to assign these nodes to processors according to their labels. By using 
the 2-way stripes partition, the 2-tuple labels assigned to nodes imply the following lemma. 
Lemma 2 : For any two neighbor nodes node(i) and node(j) with labels (lip li2) and (ljp /h), 
respectively, we have llh -lh I s 1 and 11;2 -lh I s 1. • 
To assign nodes to processors according to their labels, we need to flatten ann-cube into 
a two dimensional form. For any two neighbor processors processor(_it, h) and processor(iz, 
h) in a mesh, we have I i1 - izl s 1 and I h-hi s 1. To map a FEG onto a mesh, the neighbor 
mapping can be easily achieved by assigning node(i) with labels (/;1' /i2) to processor(lip /;2). 
Since ann-cube can emulate 1x 2n, 2x 2n-1, ... , 2nx 1 meshes, we will try all cases. A binary 
reflected Gray code (BRGC) [ChSa86] is defined as follows: 
N. - {(0, 1) 
k - ONk-1 + lNk-1 * 
if k = 1 
if k > 1 (5) 
where + and * denote sequence concatenation and sequence reversal operations, respectively. 
From Equation 5, we know that N1 = (0, 1), N1* = (0, 1)* = (1, 0), Nz = ON1 + lN1* = 0(0, 
1) + 1(1, 0) = (00, 01) + (11, 10) = (00, 01, 11, 10), N3 = (000, 001, 011, 010, 110, 111, 101, 
100), Nf....O) = 000, and Nf....3) = 010. Note thatNk{r) denotes the (r+ 1)th element ofN~o where 
r = 0, ... , t' -1. To embed a 2: x 2! mesh in a (x + y )-cube, we assign processor(i, j) in a mesh 
to the processor in the (x+y)-cube according to the following equation: 
addr : processor(i, J) --+ Nx(i) A Ny(j), (6) 
where 0 s i s 2:-1, 0 s j s 2!-1, and A is the binary string concatenation operation. 
12 
An example of embedding a 2 x 4 mesh in a 3-cube by using Equation 6 is shown in Figure 
4( e). In Figure 4( e), the addresses of processor(O, 2) and processor(!, 0) are N1(0) A. NJ...2) = 
011 and N1(1) A. NJ...O) = 100, respectively. By using the BRGCs, the addresses of any two 
adjacent processors and any two neighbor processors of a mesh differ by one and two bits, 
respectively, when the mesh is embedded in a hypercube. 
Let L~ represent the number of nodes whose labels are equal to b in the ath stripes 
partition, where a = 1 or 2 Let L1 and ~ represent the largest label numbers of the first 
and the second stripes partition, respectively. Assume that a 'lf x 'l! mesh is embedded in a 
(x+y)-cube by using Equation 6. If 'lf-1 < Lt ('l!-1 < LV. we will merge the two adjacent 
stripes m and m+ 1 (nand n+ 1) which minimize Li + vr+l (Li + q+1), for all m = 0, 
... , L1-1 (for all n = 0, ... , ~-1). This merge processing continues till L1 = 'lf-1 (~ = 'l!-1). 
The computational complexity of this merge process is equal to O(N2). After this merge 
processing, every node in a FEG is assigned a new 2-tuple labels (It', /z'), where 0 :s; It' < 
'lf-1 and 0 :s; /z' < 'l! -1. Then, we assign nodes with new labels to processors of a (x+ y)-cube 
according to the following equation: 
aile: node(1)-- Nx:(.ft')A. Ny([z'), (7) 
where the 2-tuple labels (lt',lz') are the new labels assigned to node(z), 0 :s; It' < 'lf-1 and 0 
:s; /2' < 'l! -1. An example of partitioning a FEG into stripes and assigning stripes to a 3-cube 
is shown in Figure 4. 
The algorithm of the 2-way stripes partition and allocation is given as follows. 
algorithm 2 _way_ stripes_yartition _ a/location(row, col) 
I* row and col denote the length and width of a mesh, respectively. *I 
1. Calculate the adjacent and neighbor nodes of each node in a FEG. 
2 The first stripes partition. 
3. The second stripes partition 
13 
Figure 4(a) : The labels assigned to 
nodes by the first stripes partition. 
Figure 4(c) : The 2-tuple labels 
assigned to nodes. 
Figure 4(b): The labels assigned to nodes 
by the second stripes partition. 
Figure 4(d) : A 3-cube. 
(1,0) (1,1) (1,2) (1,3) 






--· Puol I 
Figure 4(f) : The new labels of nodes 
after merging stripes. 
Figure 4(g) : Allocate nodes to processors 
by using Equation 7. 
4. Merge stripes produced by the first and second stripes partition if necessacy. 
5. Assign nodes to processors according their new labels by using Equation 7. 
end_ o/_2 _ way_stripes_partition_allocation 
In algorithm 2_way_stripes_partition_al/ocation, line 1 requires O(the number ofFEs of 
N 
a FEG) = O(N) time; both lines 2 and 3 require O(L #(NB(node(1)))) = O(N) time; line 4 
i=l 
requires O(N2) time; and line 5 requires O(N) time. The computational complexity of this 
algorithm is equal to O(N + N + N + N 2 + N) = O(N2). 
15 
3.2. Phase II : The Load Balance Phase 
The objective of this phase is to balance the computational load assigned to processors 
in the first phase while preserving the neighbor mapping property. It consists of two steps. 
In the first step, an Mx M load transfer matrix A is computed. Element aij in A denotes the 
number of nodes Pi needs to transfer to Pj. If aij is negative, I aij I denotes the number of nodes 
Pi needs to receive from Pj. Since a ?x 'lJ' mesh is embedded in a (x+y)-cube, we can start 
with computing the balanced load for processor Nx(O) A Ny(O), i.e., the number of nodes 
Nx(O) A Ny(O) needs to transfer to or receive from its neighbor processors. Next, we compute 
the balanced load for processor Nx(O)A Ny(1). This process continues till the balanced load 
for processor Nx(T-1)A Ny('lJ'-1) have been computed. 
Let load(Nx(i) A Ny(j)) denote the number of nodes assigned to Nx(i) A Ny(j), 
right(Nx(i) A Ny(j)) denote the right adjacent processor of Nx(i) A Ny(j), i.e., Nx(i) A Ny(j + 1), and 
down(Nx(i) "'Ny(j)) denote the down adjacent processor of Nx(i) A Ny(j), i.e., Nx(i + 1) A Ny(j). 
Note that processors Nx(i) A Ny('lJ' -1) and Nx(T-1) A Ny(j) do not have right and down adjacent 
processors, respectively. To balance the computational load among processors, every 
processor should be assigned ~ nodes. For simplicity, we assume N is divisible by M. The 
number of nodes needed to be transferred to or received from the right or down adjacent 
processor of Nx(i) A Ny(j) is determined by the following rules. This scheme is similar to that 
of (SaEr87]. 
Rule 1: load(Nx{l)A Ny(j)) > ~. If load(down(Nx(i)A Ny(j))) < load(right(Nx(i)A Ny(j))), 
then Nx(i)ANy(j) needs to transfer one node to down(Nx(i)A Ny(j)); otherwise, Nx(i)A Ny(j) 
needs to transfer one node to right(Nx(i)A Ny(j)). We update the load of processors and 
continue to apply Rule 1 tillload(Nx(i)A Ny(j)) = ~. For those processors do not have right 
or down processors, the load of their right or down processors are equal to oo. 
16 
Rule 2: /oad(Nx{l)"'Ny(J)) < ~- If load(down(Nx{l)"'Ny(J))) > load(right(Nx{l)"'Ny(J))), 
then Nx(l)"' Ny(J) needs to receive one node from down(Nx(l)"' Ny(J)); otherwise, Nx(l)"' Ny(J) 
needs to receive one node from right(Nx{l)"'Ny(J)). We update the load of processors and 
continue to apply Rule 2 till/oad(Nx(l)"' Ny(J)) = ~. For those processors do not have right 
or down processors, the load of their right or down processors are equal to -oo. 
Rule 3: If load(Nx{l)"' Ny(J)) = Z, then the load of this processor is balanced. 
The time required to compute the load transfer matrix is equal to O(MN). 
In the second step, we perform the load transfer from one processor to another according 
to the load transfer matrixA. The algorithm proceeds iteratively, in an incremental manner, 
and is similar to that of [SaEr87]. 
algorithm /oad_transfer(A) 
t• ND(P;) denote the set of nodes assigned to processor Pi and A is the load transfer matrix. •t 
1. Q = The set of processors that need to transfer nodes to other processors and mark them 
as active; 
2. Make a heap H(P) for all the processors in a hypercube according to their load; 
3. repeat 
4. { repeat t• Consider transferring node(x) E ND(p;) to Pi such that node(x) E NB(node(y)) 
and node(y) E ND(Pi )).•t 
5. { Pi = the active processor with the largest computational load in Q; 
6. max_load = /oad(root(H(P))); t• The maximal load assigned to processors •t 
7. if (3 j, node(x) such that aii > 0, load(Pi) < max_load, node(x) e ND(Pi), 
ND(Pi) n NB(node(x)) is not empty, and transfer of node(x) from 
Pi to Pi preserves the neighbor mapping ) then 
8. { Assign node(x) to Pi; load(P;) = load(Pi) - 1; load(Pi) = load(Pi) + 1; 
aii = a, - 1; Update H(P) 
9. if(V k = 0, ... , M-1, aik = 0) then Q = Q- {Pi}; 
10. if (V P1 E Q, P1 is inactive, and NB(node(x)) n ND(P1) ¢ 0) then mark P1 as active;} 
11. else mark p; as inactive; 
12. } until (all processors in Q are inactive); 
13. Mark all the processor in Q as active; 
17 
14. repeat/* Consider transferring any node in ND(Pi ). */ 
15. { Pi = the active processor with the largest computational load in Q; 
16. mox_load = load(root(H(P))); 
17. if (3 j, node(x) such that a, > 0, load(Pi) < mox_load, node(x) E ND(Pi), 
and transfer of node(x) from Pi to Pi preseiVes the neighbor mapping) then 
18. { Assign node(x) to Pi; load(Pi) = load(Pi) - 1; load(Pi) = load(Pi) + 1; 
aq = aq - 1; Update H(P); 
19. if(V k = 0, ... , M-1, aile = 0) then Q = Q- {Pi}; 
20. if (V Pt E Q, p, is inactive, and NB(node(x)) n ND(Pt) ;t. 0) then mark Pt as active;} 
21. else mark Pi as inactive; 
22 } until (all processors in Q are inactive); 
23. } until (load is balanced or further balancing is impossible); 
end_ o/_load_transfer 
In algorithm load_transfer, lines 1 and 2 require O(M) time; lines 5 and 15 require O(M) 
time; lines 6 and 16 require O(ct) time; line 7 requires 0(2 x #(ND(Pi)) x (#(NB(node(x))) 
+ #(NB(node(x))))) = O(N); lines 8 and 18 require O(Iog.M) time; lines 9, 13, and 19 require 
O(M) time; lines 10 and 20 require O(Mx #(NB(node(x))) = O(M) time; lines 11 and 21 require 
O(c2); lines 12 and 22 require O(c3) time; line 17 requires 0(2 x #(ND(p;)) x #(NB(node(x)))) 
= O(N) time; line 23 requires O(c4) time, where c1, c2, c3, and c4 are constants. Unes 3 to 23, 
lines 4 to 12, and lines 14 to 22 form loops and these loops have O(c), O(MN), and O(MN) 
iterations, respectively, where cis a constant. The computational complexity of this algorithm 
under this assumption is equal to O(M + M + c x (MN X (M + Ct + N + logM + M + 
M + c2 + c3) + M + MN x (M + c1 + N + logM + M + M + c2 + c3) + c4) = O(MlN 
+ MN'l). We assume that N is usually greater than M. The worst case of the computational 
complexity of this algorithm is O(MlN + MN2) ~ O(MNl). 
18 
Algorithm load_transfer does not guarantee to balance the computational load of 
processors. If the computational load of processors can be balanced by this algorithm, the 
values of all the elements in A are equal to zeros. 
The 2-way stripes partition mapping algorithm is given as follows. 
algorithm 2_way_stripes_partition_mapping(M, N. X) 
I* X is the intermediate processor matrix. V Xv EX, if p; = a,._1 ... ak+ 1alif1k-t···ao and 
Pi = bn-1···bk+tli~k-1···ao' then Xij = Pt = an-1···ak+1a~k-1•••ao *I 
1. row = 1; col = M; best_bi = 0; best_uni = 0; 
2. repeat 
3. { 2 _way_ stripes _partition_ allocation(row, col); 
4. Compute the load transfer matrix A; 
5. load_transfer(A); 
6. if (best_ hi < bidirectional_comm_cost(X)) then best_bi = bidirectional_comm_cost(X); 
7. if(best_uni < unidirectional_comm_cost(X))thenbest_uni = unidirectional_comm_cost(X); 
8. row = row * 2; col = col I 2; 
9. } until (row > M); 
end_ of_ 2 _way_ stripes _partition_ mapping 
In algorithm 2_way_stripes_partition_mapping, line 1 requires O(cJ) time; line 3 requires 
O(N2); line 4 requires O(MN) time; line 5 requires O(MN2) time; lines 6 and 7 require O(N 
+ Mlog2 M) time; line 8 requires O(cz) time; and line 9 requires O(c3) time, where c11 c2, and 
c3are constants. Lines 2 to 9 form a loop and this loop has logMiterations. The computational 
complexityofthis algorithm is equal to0(c1 + logM x (N2 + MN + MN2 + (N + Mlog2 M) 
+ (N + Mlog2 M) + c2 + c3) = O(MN21ogM). 
Lemma 3 : The 2-way stripe partition mapping is a neighbor mapping. • 
4. THE GREEDY ASSIGNMENT MAPPING 
19 
The greedy assignment mapping is a heuristic approach. It assigns a node to a particular 
processor according to the current status of its neighbor nodes. Initially, it assigns node(a), 
which has the largest number of adjacent nodes in a FEG, to processor 0 and the adjacent 
nodes of node(a) are put into a queue Q. The node node(i) in Q which has the largest number 
of adjacent nodes is selected as the next node to be assigned. Let P(NB(node(i))) denote the 
set of processors which the neighbor nodes of node(i) are assigned and P(POS(node(i))) denote 
the set of processors whose addresses differ from the address of each processor in 
P(NB(node(i))) by at most two bits. If P(POS(node(i))) is empty, it implies that the neighbor 
mapping is impossible for this approach; otherwise, for all Px, Py E P(POS(node(i))) and 
load(Px) ::::;; load(Py ), it assigns node(i) to Px. Then, the adjacent nodes of node(i) are inserted 
in Q. This process continues till all the nodes are assigned or the neighbor mapping is 
impossible. The algorithm is given as follows. 
algorithm greedy_ assignment_ mapping( X) 
!* X is the intermediate processor matrix. 'V xii E X, if Pi = a,._1 ... ak+ 1a#k-l· •• a0 and 
pj = bn-l···bk+!ll#k-l···ao, thenxii =PI= an-l···ak+lll#k-l···ao *! 
1. Calculate the adjacent and neighbor nodes of each node in a FEG; 
2. Q = 0; 
3. node(a) = The node with the largest number of adjacent nodes in a FEG; 
4. Assign node( a) to processor 0 and Q = Q U ADJ(node(a)); 
5. Make a heap H(Q) for the nodes in Q according to the number of their adjacent nodes ; 
6. while (Q is not empty) do 
7. { node(i) = root(H(Q)); !*the node with the largest number of adjacent nodes in Q *I 
8. Compute P(POS(node(i))). 
9. if (P(POS(node(i))) is empty) then stop ("The neighbor mapping is impossible"); 
10. Px = the processor with the smallest load in P(POS(node(i))); 
11. load(Px) = load(Px) + 1; Q = Q- {node(i)}; 
Q = Q u {those nodes in ADJ(node(i)) which have not been assigned} ; Update H(Q); 
12. } 
13. best_bi = bidirectional_comm_cost(X); 
14. best_uni = unidirectional_comm_cost(X); 
end_ of _greedy _assignment_ mapping 
20 
In algorithm greedy_assignment_mapping, line 1 requires O(the number of FEs of FEG) 
= O(N) time; line 2 requires O(ct) time; line 3 requires O(N) time; line 4 requires O(c2) time; 
line 5 requires O(c3) time; line 7 requires O(c4) time; line 8 requires O(#(NB(node(z))) X 
log2 M) = O(log2 M) time; line 9 requires O(cs) time; line 10 requires O(log2 M) time; line 
11 requires O(logN) time; and lines 13 and 14 require 0 (N + Mlog2 M), where ct, c2, CJ, c-~, 
and cs are constants. Lines 6 to 12 form a loop. This loop has N iterations. The computational 
complexity of this algorithm is equal to O(N + c1 + N + c2 + CJ + N x ( c4 + log2 M + 
cs + log2 M + logN) + (N + Mlog2 M) + (N + Mlog2 M)) = O(Nlogl M + NlogN). An 
example of mapping a FEG onto a hypercube by using algorithmgreedy_assignment_mapping 
is shown in Figure 5. 
Figure 5 : Mapping a FEG onto a hypercube by using the greedy as-
signment mapping. 
5. PERFORMANCE EVALUATION AND SIMULATION RESULTS 
21 
The samples of FEGs tested in this paper consist of four 2-D graphs and three 3-D 
graphs which are shown in Figures 7(a)-(d) and 7(e)-(g), respectively. The number of nodes 
of these FEGs are ranging from a few tens to a few hundreds. According to the communication 
models described in Section 23, we derive the estimated lower bound speedup (ELBS) and the 
estimated upper bound speedup (EUBS) for both of the bidirectional and unidirectional 
communication models to measure our mapping results. They are given as follows: 
N X Ttask 
X Trask+ Tsetup + 2 X Tc { EUBSbi = rMNl 
N X Ttask 
ELBsbi = -=r~Z"""l,.---x_r,_ras_k_+_2_x_T_se_tup_+_(....;.2=x~l-o-gM---1-)-x--;:-r"Z~l-x_T_c 
X Trask + 2 X (Tsetup + 2 X Tc) { EUBSuni = rMNl 





where T task• Tsetup and Tcdenote the time required by a processor to execute the computation 
of a node, the setup time of the I/0 channel, and the data transmission time of the I/0 channel 
per word, respectively; EUBSbi and ELBSbi denote the EUBS and ELBS of the bidirectional 
communication model, respectively; EUBSum and ELBSum denote the EUBS and ELBS of 
the unidirectional communication model, respectively. 
The EUBS and ELBS are obtained by assuming that both the LBC and the neighbor 
mapping are achieved. If the LBC is achieved by a mapping, the item max{load;(pi)} in 
Equation 2 is equal to r ~1- If a mapping is a neighbor mapping, the best case of the 
communication cost is that any two neighbor nodes of a FEG are assigned to the same 
processor or two adjacent processors of a hypercube and every processor only need to send 
two nodes' data to each of its adjacent processors (see Figure 6). According to the 
communication models described in Section 23, we can derive Equations 8.1 and 9.1. 
22 
Figure 6 : The best case of the communication cost of a mapping. 
If a mapping is a neighbor mapping, the worst case of the communication cost is that 
any two neighbor nodes of a FE are assigned to two processors whose addresses differ by two 
bits in a hypercube. For the bidirectional communication model, the maximal number of steps 
to finish the data communication among processors is equal to 2. In step 1, a processor 
receives data from its adjacent processors and sends data to its neighbor processors 
simultaneously. The maximal amount of data sent by processors is equal to logMX r ~1. In 
step 2, the maximal amount of data sent from a processor to its adjacent processors is equal 
to (logM-l)x rz1 (see Figure 3(a)). Therefore, we can derive Equation 8.2 For the 
unidirectional communication mode~ the maximal number of steps to finish the data 
communication among processors is equal to 4. A processor may receive (send) data from 
(to) its adjacent (neighbor) processors in step 1 and then send (receive) data to (from) its 
neighbor (adjacent) processors in step 2. The maximal amount of data sent by processors in 
steps 1 and 2 are both equal to logMx r ~1- A processor may receive (send) data from (to) 
its adjacent (adjacent) processors in step 3 and then send (receive) data to (from) its adjacent 
(adjacent) processors in step 4. The maximal amount of data sent by processors in steps 3 and 
4 are both equal to logMX r ~1 - 1. (see Figure 3(b )). Therefore, we can derive Equation 
9.2. 
23 
We make the following assumptions about the capabilities of the processors of a 
hypercube [SaEr87]. T task is equal to 1190 flS. Tsetup is equal to 1150 flS. Tc is equal to 10 J..LS 
per word. By using the 2-way stripes partition mapping and the greedy assignment mapping, 
the speedups of all the test samples on n-cubes are shown in Thbles 1, 2, and 3, for n = 3, 
4, and 5, respectively. The following conclusions can be drawn from Thbles 1, 2, and 3. 
1) : Both mappings give excellent performance. The estimated speedups of these 
mappings are near optimal (given by EUBS) for most cases. 
2) : The greedy assignment mapping, in general, can produce a good mapping at a low 
computation cost. This method is not restricted to hypercubes and can be applied to a wide 
variety of parallel architectures. It fails to preserve the neighbor mapping for sample 5 onto 
4- and 5-cube. Every node in sample 5 has the same number of adjacent nodes. It is difficult 
for this algorithm to determine which node is the best node to be assigned next because the 
degree of all nodes are the same. 
3): For the cases where the LBC is achieved, the speedups for the 2-way stripes partition 
mapping are better than those for the greedy assignment mapping. The reason is that, by using 
algorithm 2_way_stripes_partition_mapping, most of the nodes assigned to the same processor 
are connected. It produces a smaller communication cost than that of the greedy assignment 
mapping. 
4) : In Thble 3, although the LBC is not achieved in sample 6 by using the 2-way stripes 
partition mapping, the speedup for the 2-way stripes partition mapping is greater than the 
value of the ELBS when the unidirectional communication model is used. The is because that 
the number of steps, denoted by S, to finish the data communication among processors in the 
unidirectional communication model is greater than 1 and less than 5, i.e. 2 s S s 4. If the 
maximal computational load assigned to processors by the 2-way stripes partition mapping 
is equal to r ~l + 1 and sis equal to 3, according to Equation 9.2, it is possible that the 
speedup for the 2-way stripes partition mapping is greater than the value of the ELBS. 
6. CONCLUSIONS 
24 
We proposed two mapping approaches, the 2-way stripes partition mapping and the 
greedy assignment mapping, to map FEGs onto hypercubes. The 2-way stripes partition 
mapping uses the stripes partition and BRGCs allocation to achieve the MCCC and uses the 
load transfer heuristic to achieve the LBC. The greedy assignment mapping uses greedy 
heuristic to achieve both MCCC and LBC. The cost models of mapping a FEG onto a 
hypercube are developed for the bidirectional communication model and the unidirectional 
communication model. Four 2-D and three 3-D FEGs are used as the test samples. To 
measure the mapping results, the EUBS and ELBS are derived for both of the communication 
models. The simulation results show that the speedups for the 2-way stripes partition 
mapping are better than those for the greedy assignment mapping when the LBC is achieved 



























Unidirectional communication model Bidirectional communication model 
EUBS ELBS Greedy 2-way stripes partition EUBS ELBS Greedy 2-way stripes partition 
6.42 5.10 5.19* 6.36* 7.12 6.13 6.32* 7.08* 
5.74 4.31 3.92 4.44* 6.69 5.47 4.94 5.n• 
7.28 6.26 6.09 7.20* 7.63 6.97 6.78 7.57* 
7.66 6.89 7.00* 7.61" 7.77 7.34 7.44* 7.75* 
7.28 6.26 6.07 6.57* 7.63 6.97 6.77 7.20* 
7.56 6.74 6.80* 7.18* 7.73 7.25 7.33* 7.52* 
7.34 6.39 6.48* 7.23* 7.62 6.30 7.12* 7.29* 
Thble 1 : The speedups of mapping FEGs onto 3-cubes. 
Unidirectional communication model Bidirectional communication model 
EUBS ELBS Greedy 2-way stripes partition EUBS ELBS Greedy 2-way stripes partition 
10.73 7.68 6.97 7.93* 12.84 10.10 9.00 10.61* 
8.05 5.54 5.68* 5.70* 10.04 7.58 7.97" 7.96* 
13.37 10.64 10.14 11.23* 14.57 12.61 12.01 13.17* 
14.87 12.74 13.31* 14.74* 15.31 14.03 14.43* 15.24* 
13.37 10.64 - 11.13* 14.57 12.61 
-
13.06* 
14.19 11.95 12.46* 12.79* 14.79 13.39 13.77* 13.99* 
13.23 10.76 11.19* 11.47* 14.16 12.48 12.90* 13.06* 
Thble 2 : The speedups of mapping FEGs onto 4-cubes. 
Unidirectional communication model Bidirectional communication model 
EUBS ELBS Greedy 2-way stripes partition EUBS ELBS Greedy 
16.14 10.38 9.02 9.06 21.45 15.05 12.65 
10.08 6.49 6.65* 6.66* 13.41 9.41 9.94* 
22.97 16.63 14.21 14.21 26.74 21.39 17.44 
28.11 22.66 24.15* 24.75* 29.74 26.15 27.30* 
22.97 16.63 - 11.97 26.74 21.39 
-
26.22 20.57 21.61* 20.65 28.37 24.40 25.43* 
22.08 16.60 17.41* 17.57* 24.80 20.56 21.46* 
Table 3 : The speedups of mapping FEGs onto 5-cubes. 
* : The LBC is achieved in this case. 





























Figure 7 : The test samples. 
References : 
[BeBo87] M.J. Berger and S.H. Bokhari, "A Partitioning Strategy for Nonuniform Problems 
on Multiprocessors," IEEE Trans. on Computers, Vol. C-36, No. 5, pp. 570-580, 
1987. 
[BhAg84] L.N. Bhuyan and D.P. Agrawal, "Generalized Hypercube and Hyperchannel 
structures for a Computer Network," IEEE Trans. on Computers, Vol. C-33, pp. 
323-333, 1984. 
[Bokh81] S.H. Bokhari, "On the mapping problem," IEEE Trans. on Computers, Vol. C-30, 
pp. 207-214, 1981. 
[ChSa86] 'I:R Chan and Y. Saad, "Multigrid Algorithms on the Hypercube 
Multiprocessors," IEEE Trans. on Computers, Vol. C-35, pp. 969-977, 1986. 
[GaJo79] M.R. Garey and D.S. Johnson, Computers and Intractability, A Guide to Theory 
of NP-completeness. San Francisco, CA: Freeman, 1979. 
[HaMu86] J. Hayes and 'I Mudge, "Architecture of a Hypercube Supercomputer," Proc. of 
Int'l Conference on Parallel Processing, pp. 653-660, 1986. 
[Jord78] H. Jordan, "A special purpose architecture for finite element analysis," Int'l 
Conference of Parallel Processing, pp. 263-266, 1978. 
[LaPi83] L. Lapidus and G.R Pinder, Numerical Solution of Partial Differential Equations 
in Science and Engineering. New York: Wiley, 1983. 
[Peas77] M.C. Pease, "The Indirect Binary n-cube Multiprocessor Array," IEEE Trans. 
on Computers, Vol. 26, pp. 458-473, 1977. 
[SaEr87] P. Sadayappan and R Erca~ "Nearest-Neighbor Mapping of Finite Element 
Graphs onto Processor meshes," IEEE Trans. on Computers, Vol. C-36 No. 12, 
pp. 1408-1424, 1987. 
[Seit85] C.L. Seitz, ••The Cosmic Cube;• Communications of ACM, Vol. 28, pp. 22-33, 
1985. 
28 
