Local uniform mesh refinement on loosely-coupled parallel processors  by Gropp, W.D.
Comput. Math. Applic. Vol. 15, No. 5, pp. 375-387, 1988 0097-4943/88 $3.00+0.00 
Printed in Great Britain. All fights reserved Copyright © 1988 Pergamon Press pie 
LOCAL UNIFORM MESH REF INEMENT ON 
LOOSELY-COUPLED PARALLEL  PROCESSORSt  
W. D. GRoPP 
Research Center for Scientific Computation, P.O. Box 2158 Yale Station, Yale University, New Haven, 
CT 06520, U.S.A. 
(Received 16 September 1987) 
Communicated by M. H. Schultz 
Abstract--Multiprocessor ystems offer large gains in performance tfaigorithms for real problems can be 
found. We show how one algorithm for solving time dependent partial differential equations, local 
uniform mesh refinement, can be implemented on a multiprocessor system. Care is taken to insure that 
communications costs are kept under control, and an estimate of the performance of this algorithm for 
a range of configurations is presented. Experiments on a multiprocessor system are compared with the 
theory. 
1. INTRODUCTION 
Local uniform mesh refinement (LUMR) is a powerful technique for the solution of partial 
differential equations. It is basically a strategy for the placement of uniform grids on a coarsest 
mesh which reduces the amount of work by attempting to equidistribute the error committed 
during the calculation. The use of uniform grids allows the computation on each of the refined 
grids to be done efficiently, particularly on modem vector and parallel computers. In this paper, 
we discuss the issues in implementing LUMR on a loosely coupled system of parallel processors. 
There has been much discussion of ways to implement numerical algorithms on parallel 
computers, but many of these algorithms fail to seriously consider communications costs. To see 
the impact of communication costs, consider a regular mesh of points, with each point separated 
by a distance of one unit. Assign to each processor a square with sides of length L. If we assume 
that advancing the solution at each point requires only information from neighboring points, then 
the communication is proportional to 4L while the computation is proportional to L 2. If L = 1 
(one processor per point), and if communication is as expensive as calculation, then more time is 
spent communicating with adjacent nodes than in computation. As L increases, the penalty in 
communication decreases. (See[l] for more discussion of the effects of communication on 
algorithms.) We will take communication times into account and show that LUMR addresses this 
problem by essentially keeping L from getting too small. Specifically, we will answer the questions: 
where are the bottlenecks, how should the processors be divided up, and what is the parallel 
efficiency of LUMR? We present experiments on one kind of parallel processor and compare them 
to our experimental results. 
2. ASSUMPTIONS 
Communication between processors i not free. Particularly on high performance systems, we 
should expect communication to be an important cost in any computation. We will consider a 
parallel processing system which potentially has a large number of processors, with "large" here 
being hundreds to thousands. Each processor is capable of solving the PDE on a rectangle and 
of building the necessary data structures (see Section 3). The processors are connected together 
either by a common bus or local area network (such as the Apollo ring network), or in a linear 
array (such as the ICAP [2]). 
?This work was supported in part by Office of Naval Research Contract No. N00014-82-K-0184, National Science 
Foundation Grant MCS-8106181 and Air Force Office of Scientific Research Contract AFOSR-84-0360. 
375 
376 W.D. GRopp 
Bus Linear A~sy 
Fig. 1. Architectures under consideration. 
In the bus architecture, we will assume that only one processor can access the bus at a time. 
The bus has a constant bandwidth independent of the number of processors, o as the number of 
processors accessing the bus goes up, so does the amount of time to complete a transaction over 
the bus. In the linear array, each processor can talk to each of its neighbors imultaneously. The 
time to complete a transaction over the array is independent of the number of processors (the 
bandwidth scales with the processors). We will see that there are significant differences between 
these two kinds of architectures. 
3. SER IAL  LUMR ALGORITHM 
We briefly describe the algorithm for LUMR in two space dimensions on a single processor; a 
more complete discussion of some variants of LUMR may be found in [3-6]. We will also discuss 
only a single level of refinement, as multiple levels do not affect the estimates we make here. 
The key points of the algorithm are the following: the basic object is a grid, rather than a single 
point. The refinement is in terms of oveflayed grids, rather than inserted points. And the approach 
is algebraically equivalent to a pointwise refinement scheme which refines the same points. In 
essence, LUMR is a way to trade off a little additional work in terms of additional refinement for 
a simpler data structure and significantly less data structure overhead. 
The coarse grid is made up of a union of rectangles; each rectangle has a discrete, com- 
putationally uniform grid. If two rectangles are connected, they must overlap (this is an 
implementation requirement) rather than simply abut. This overlap is usually only one space step 
in width. Every R~ (R~ for ReGridding) time steps on the coarse grid, a new refinement also 
consisting of a union of rectangles i chosen. These new grids are refined by a factor R, that is, 
the space and time step sizes on the refined grids are reduced by a factor of R from those on the 
coarse grid. Each grid in the refinement is overlayed on top of the coarse grid, rather than inserted 
into the coarse grid, in order to preserve the uniformity of the coarse grid. This refined grid is "lined 
up" with its coarse parent; that is, it is not rotated with respect to its parent as in Berger and Oliger 
[3]. While the overlaying of the grids simplifies the computation on both the fine and coarse grids, 
it does introduce a number of minor but important complications. First, some points in the domain 
are overlayed by both a coarse grid point and a fine grid point. This means that we must inject 
the values computed on the fine grid into the coarse grid wherever a coarse grid point lies inside 
the refinement. Second, since the fine grid is not patched into the coarse grid, we must do something 
special at coarse/fine and fine/fine grid boundaries. At coarse/fine grid boundaries, we use 
interpolation from values computed on the coarse grid to provide values for the fine grid. At 
fine/fine grid boundaries, we can take the values needed from the appropriate fine grid (the grids 
on a level overlap one another to maintain an accurate solution across a fine/fine boundary). In 
essence, this makes the fine/fine grid boundary a computational rtifact, since the exact same 
computations (inthe absence of round-off error) would be made if the fine/fine grid boundary were 
not present. This fact will be important in the parallel algorithm, since we will take advantage of
it in the partition of work among the processors. Both of these steps are necessary inorder to insure 
an accurate solution. The first is required to provide consistent data for the fine grid. The second 
is necessary to preserve the accuracy on the fine grid (if the values were taken from the coarse grid, 
the accuracy would be reduced to that of the original coarse grid computation). Thus, the fine and 
coarse grids must communicate with each other frequently. The algorithm isshown in Algorithm 1. 
The potential bottlenecks in a parallel version of Algorithm 1 lies in the steps which need to 
communicate b tween processors. These are the regridding step 2, the handling of the boundaries 
Local uniform mesh refinement on loosely-coupled parallel processors 
Algorithm 1. Algorithm for serial LUMR. The costs are described in Section 4
377 
timestep ,.-O. 
Repeat until done: 
1. Integrate the coarse grid. 
Total cost: C~n c(data structure cost is negligible). 
2. Regrid if timestep mod R~ = O. 
2.1. Decide where to place the refinements. 
2.2. Initialize the new fine grids with the data in either the old fine grids or in the coarse (parent) grid. (Must use 
the old fine grids if possible.) Total cost: negligible. We assume a fast, user-defined function determines where 
to refine. 
3. Integrate 
3.1. For R steps: Integrate each fine grid by integrating 
3.1.1. Interior of fine grid 
Total cost: R2n/Cv 
3.1.2. Boundary of fine grid. Fine grids may need to communicate with each other (at overlaps between fine-grids) 
and with the coarse grid (where there is a coarse-fine grid boundary). 
Total cost: negligible compared to interior (Clx/~fR). 
3.2. Update the coarse grid with the new solution on the fine grid. 
timestep ~ timestep + I. 
Total cost: negligible (nf). 
3.1.2, and the update step 3.2. We can reduce each of these bottlenecks by taking advantage of 
the local nature of solutions to hyperbolic PDEs and the local (and uniform) nature of the 
refinements to reduce the amount of interprocessor communication. 
Note that the feedback between the grids means that we can't do the fine and coarse grids 
simultaneously. This suggests that, for explicit methods at least, we partition the algorithm among 
many processors by keeping the fine grids as close to the coarse (parent) grids as possible, with 
closeness determined by the communication network. 
4. COSTS 
We will compare the serial and parallel algorithms, and motivate the choices in the parallel 
algorithm, by estimating the cost of various choices. In order to simplify the discussion, we will 
make certain assumptions about the costs in the algorithm: 
(1) The time to do regridding (step 2.1) and the time to do updates (as in step 2.2 and 3.2) are 
negligible compared to the time to do the integration. This includes the time to set up and 
move the data structures. 
(2) The time to transfer data from one processor to another is non-negligible. 
(3) The size of the data structure is negligible compared to the size of the grids. 
In justification of these assumptions, we note that the first point will be true as long as the process 
for finding the new grids is cheap compared to the amount of work in doing an integration step. 
This will be true if some simple measure of the solution quality, such as a combination of first and 
second derivatures in the error are used. The second point depends of course on the particular 
hardware and configuration; as Fig. 5 shows, this assumption is valid for our experiments. The 
third point depends on the size of the problem and the implementation; it is the goal of local 
uniform mesh refinement, and has been demonstrated for a number of problems. 
With these assumptions, we will estimate the costs with the following parameters: 
C~ = time per point in integration. 
CT = time per point to transfer data from one processor to another. 
Rg = number of time steps on the coarse grid between regriddings. 
R = ratio of refinement between coarse and fine grids. 
n0 = number of points in the coarse grid. We assume that the coarse grid is (roughly) square. 
n/= amount of grid which is refined, in terms of coarse grid points, n,-=fnc, where f  is the 
fraction of the grid which is refined. If the fine grid is (roughly) square, then the actual 
number of fine grid points is roughly R2n/.  
p = number of processors. 
378 W.D.  Ggopp 
We will assume that Cr = a(p)C~, where Ca- is roughly comparable with C~. This is an assumption 
that communication is (possibly much) slower than a single floating point operation. For the ring 
architecture, ~(p) is independent of p; for the bus, ~(p) ~ p. We will write this as ~(p)--~ for 
the ring and g(p)= gp for the bus. 
Rg is independent of the refinement ratio and depends only on the problem. It is roughly the 
minimum size of a refined grid (in problem coordinates) divided by the maximum velocity of 
solution. This minimum size is basically the amount of buffer put around the refined grids to keep 
the part of the solution which needs to be refined from escaping from the fine grid before the next 
regridding step. This minimum size is usually picked as a certain number of steps (Ax or Ay) in 
the fine grid. Since the time step At is refined with the space steps, the number of steps that can 
be taken before a feature could escape from the fine grid it is independent of the refinement ratio 
R. A typical value for Rg is between 4 and 10. The refinement ratio, R, is determined by accuracy 
and work constraints; typical values are 2 and 4. 
The actual number of points in the fine grids is R2n/; in other words, nf is (almost) the number 
of coarse grid points covered by the refinement. We choose to use this form rather than letting 
nf be the number of fine grid points because the ratio f = nflnc is the fraction of the area of the 
entire domain which needs to be refined. This is a constant (independent of the refinement 
parameters) determined by the behavior of the problem and the desired accuracy in the solution. 
In our estimates, we will assume that the amount of work done is roughly the same on each 
processor. We will discuss ways of achieving this in Section 8. 
In addition, we will discard terms which are of the same order as other terms, but have a much 
smaller constant. We do this primarily to simplify the formulas, and in recognition of the fact that 
the experiments do not have the resolution to show these minor terms. 
5. A PARALLEL  LUMR ALGORITHM 
One possible parallel LUMR algorithm is the following: for each rectangle in the coarse grid, 
divide the grid into p (nearly) disjoint regions. These regions correspond to separate regions in the 
physical domain; they overlap by only one space step size in the y-direction. (This is the overlap 
for "connected" grids mentioned in the description of serial LUMR.) Assign each region to a 
separate processor. The exact choice of division will depend on the architecture available. The rule 
to follow is that adjacent regions should be able to communicate quickly with each other. For the 
linear array and a 2-D problem, this suggests trips (see Fig. 2); for a bus a different partitioning 
may be better (see Section 7). In a 3-D problem, slabs would be used instead of strips. We will 
concentrate on the strip partitioning. 
Each processor isresponsible for its assigned region in the domain. That is, a processor integrates 
the solution on its domain, and creates and integrates the solution on any refined grids it finds that 
it needs in its domain. The refined grids are not sent to other processors; by keeping them local 
to the processor which their parent grid lives on, we can significantly reduce data motion. Instead, 
T 
y 
Processor 1
Processor 2 
Processor  p 
x 
Fig. 2. Partitioning of domain for a linear array of p processors. Not shown is an overlap of width Ay 
for each processor and its neighbor. 
Local uniform mesh refinement on loosely-coupled parallel processors 
Algorithm 2. Algorithm for parallel LUMR. Costs are per processor 
379 
timestep ~0. 
Repeat until done: 
1. Integrate the coarse grid. 
1 
Total cost: -C1nc + 2x/~C T. 
n 
2. Regrid if timestep rood R~ = O. 
2.1. Decide where to place the refinements. Each processor computes the data structure on its region; in doing so, 
it must receive from its neighbors the data structure for the fine grids at the region boundary. 
Total cost: negligible (data structure cost is oc l/p, considered negligible since constant small. We assume the use 
of a fast user-defined function to determine where to place the refinements.) 
2.2. Initialize the new fine grids with the data in either the old fine grids or in the coarse (parent) grid. (Must use 
the old fine grids is possible.) 
Total cost: oc l/p, considered negligible compared to the integration step 1. 
3. Integrate 
3.1. For R steps: integrate ach fine grid by integrating. 
3.1.1. Interior of fine grid. Each processor does its own region. 
Total cost: 1R2njCI. 
P 
3.1.2. Boundary of fine grid. Each processor does its own region. In addition, it must get the fine/fine boundades 
from the adjacentprocessors. 
Total cost: 2Rx/nfC x. Consider the computational effort at the boundaries negligible by comparison to 
computation on interior (oc l/p). 
3.2. Update the coarse grid with the new solution on the fine grid. 
timestep ,-- timestep + 1. 
Total cost: ~ l/p, considered negligible. 
each processor shares information along the overlap in the domain with its neighboring processor. 
Thus, the amount  of  data which must be moved is roughly the square root of  the total data, or 
equivalently, the data to be moved is "one-dimensional" while the data being computed is 
"two-dimensional".  This partitioning may change during the course of  the solution process in order 
to distribute the load among the p processors, but we will argue in Section 6 that it should not 
change too fast. The only other data which must be exchanged between processors relates to the 
description of  the grid data structure, and by the design of  LUMR,  this is small relative to the 
solution data. The algorithm is shown in Algorithm 2. 
It is important to note that this algorithm represents a "data structure decomposit ion" o f  
Algorithm 1. The exact same results are computed in the absence of  round-off  error. Thus all results 
which hold for the serial Algorithm 1 hold for its parallel version Algorithm 2. 
6. IN IT IAL IZ ING THE F INE  GRIDS 
In this section we motivate the choice of  partitioning by considering the other obvious 
alternative, partitioning by grid. Partitioning by grid puts a different grid on each processor. Thus, 
a fine grid may be placed in a different processor than the coarse grid it belongs to. This is a natural 
"task" division of  the algorithm. However, this approach does incur additional communicat ion 
costs, as we show below. 
Initializing the fine grids can be expensive if the fine grids are not restricted to lie in disjoint, 
fixed domains. If the fine grids don't lie in a single domain, then there is the additional cost of 
R2n/Cza, where a is the fraction of the fine grid which must be initialized with data from another 
processor. We assume here that any algorithm will attempt o cache data so that a will normally 
be less than 1. 
The ratio of the time to initialize the grid over the time spent integrating the grids is 
R2nfCra  pCra  
380 W.D. G~opv 
With CT/C1 = a (p), this is p~ (p)a /RR r • will typically greater than 1 and p will be larger than RR s 
(possibly much larger), so the time spent in communicating the data for the fine grids can be a 
significant fraction of the total time. As ~ gets larger, this may become a dominant term. This is 
likely for bus oriented parallel processors, as ~(p) oc p. 
This argues for an arrangement where a is small or zero, that is, an algorithm where the new 
fine grids can be initialized from local or mostly local data. Algorithm 2 uses only local data, so 
,t = 0. Another approach would be to choose a regridding algorithm which has the property that 
small changes in the solution will cause small changes in the choice of fine grids. Note that this 
is different from small changes in the area of refinement. Algorithms based on finding a near 
optimal general rectangle fit to the data do not have this property. If the algorithm does insure 
that the actual choice of grids is "continuous", a will be small, because the solution of the PDE 
is continuous in time. It should be noted that this is very hard to do. None of the approaches that 
have been suggested have this property. 
If the partitioning of the domain among the processors needs to change as a result of the 
evolution of the solution, then there will be some overhead associated with that. This can be 
handled by dynamic load balancing (see Section 8). The computation effort involved in imple- 
menting load balancing is small and can safely be ignored. Thg communication time involved is 
also small as will be seen in Section 8. 
7. ESTIMATE OF PERFORMANCE 
In this section we estimate the performance of parallel LUMR against serial LUMR. The cost 
per regridding cycle of serial LUMR is 
T s = Rg(1 .-I- R 3f)ncC I 
while the cost of parallel LUMR is 
1 r,=pr + + 
The speedup s is given by T,/Tp, and is 
1 1 2 (v)(1 + R2v/ f )  
s p ~fn-~(1 + R3f), 
where ~(p) = Cr/C~, and ~(p) = ~ for a linear array and a(p) = ap for a bus. We define 
ft. ) _ 2a(p)(1 + R2x/~) 
This gives 
1 1 
- = - +/~(p) .  (7 .1 )  
s p 
We will use (7.1) in our estimates. /~(p) is clearly roughly proportional to 0t(p), with a 
proportionality "constant" of about 2 / (R~~)  
Some values o fs  and p are of particular interest. One is the value o fp  =Pb where s = 1, i.e. the 
parallel algorithm is just as efficient as the non-parallel algorithm: 
#(p) = # 
(linear array) 
1 1 
- =-+,6  
s p 
1 
Pb= 1 - [3  
/~(p) = ~p 
(bus) 
1 1 -=-+#p 
s p 
~p~-p~+ 1 =0 
1 -- v / l - -  4,8 
pb ffi 2/~ 
Local uniform mesh refinement on loosely-coupled parallel processors 381 
Note that for the bus, when ~ > ~, the algorithm is always less efficient han the non-parallel 
version. For the linear array,//must be less than 1 for the algorithm to be more efficient han the 
non-parallel version. 
Also interesting is the speedup and the maximum speedup. 
d~ 
dp 
ds 
~ 0  dp 
1 
5ma x ~ 
~(p) ffi 
(linear array) 
1 •-+/3  
P 
P 
l - t ip  
1 
m 
(1 +//p)2 
fl ( p ) = ~p 
(bus) 
1 
at p=oo 
1 1 - f f i -+~p 
s p 
P 
1 -t-pp 2 
ds 1 __/~p2 
@ (1 + ~f)2 
ds 
~--pp=O at p=- -  
Sma x ~ ~ 
Note that ~ must be less than 1 for the parallel algorithm to break even, thus the square root makes 
the bus architecture far less effective than the linear array architecture. 
In these cases, the major bottleneck is in step 3.1.2, where the neighboring fine grid values must 
be sent to adjacent processors. By choosing a different partitioning, we can reduce this, though 
at a cost in added complexity for the communication links. For example, if we had a 2-D mesh 
connected array, we could cut up the domain into squares of side 1/x ~.  This would essentially 
divide the term in step 3.1.2 by x/P, which will improve these estimates. For example, for a 
mesh-connected (rather than linear) array, the speedup s =j~/(1 + Px/P), which is unbounded. 
However, the efficiency of processor usage is roughly l/~x/p, and thus goes to 0 as the number 
of processors g_oes to infinity. For the bus architecture~the speedup would still be bounded, with 
123 = 3 • --- / . " reducin the s p/(1 + ~x/P) and a maximum speed of s 3 x/2/// This speedup comes from g
absolute amount of data to be communicated by reducing the ratio of circumference to area for 
each of the domains to the minimum possible. Figures 3 and 4 show the speedup s as a function 
o f / /and p. Figure 3 shows that for a bus the maximum speedup is reached for a small number 
of processors unless // is very small, and that as // increases, the maximum speedup drops 
dramatically. Figure 4 shows that a ring is both faster for the same//(i.e, the same hardware), 
and less sensitive to p. 
It should be noted that this analysis ignores the fact that all the operations here have a minimum 
size. Thus, for a given problem size, as the number of processors increases, there is a limit beyond 
which there is no work for new processors. Further, we have assumed that the communication times 
are proportional to the size of the data being transmitted, which is only approximately true. As 
the amount of data decreases, the communication start-up time or latency becomes important. 
These effects guarantee a bounded speedup regardless of the interconnection structure (bus, linear 
array, or mesh). 
8. LOAD BALANCING 
One assumption we have made is that the load on each processor is the same as the load on 
any other processor. In practice, this will rarely happen. Our approach to dynamic load balancing 
is to adjust the partitioning of the physical domain among the processors. We consider the 
processors to be connected in a linear array from left to right, even if the actual hardware is a bus. 
"Logically right" means the processor to the right in this ordering. The algorithm used takes the 
following steps: 
382 W.D.  GROpe 
Fig. 3. Speedup s as a function of/7 and p for a bus architecture. The minimum value of  ~ is 0.005. 
Fig. 4. Speedup s as a function of/~ and p for a linear array architecture. The minimum value of/7 is 
0.005. 
Local uniform mesh refinement on loosely-coupled parallel processors 383 
w~ = the work on the ith processor. 
wpi = the partial sum of work on processors i = 0 , . . . ,  i - 1. 
w, = average amount of work for each processor. 
For each processor (i = 0 . . . .  ,p - 1) 
(1) Estimate the work being done on this processor (wl). 
(2) Get the partial sum wp~ from the logically left processor. Wpo = 0 for the leftmost processor. 
Form wp~+ ~ = wp~ + w~ and pass wp~+ ~ to the logically right processor. If at the rightmost 
processor (i = p - 1), form wa = wpp/p, where p is the number of processors. Wait for wa from 
the logical right and pass it on to the logical left. 
(3) Compare the estimated work to the logical right (for all the processors) to the desired work 
to the logical right. The actual work is just w, = Wap-  wdp~÷~ and the desired work is 
w d = w~(p - ( i  + 1)). If w, < Wd, then send min(w~, w d - w , )  to the logical right processor. If 
w, > Wd, then request Wr -- Wd work from the logical right. The actual work received may be 
less than requested (i.e. the logical right neighbor does not have that much work to give). 
Thus this algorithm proceeds in two passes up and down the line of processors. The pass from 
processor 0 to p - 1 gathers up the wpg values. The pass from processor p - 1 to 0 passes wa to 
processor i - 1, and redistributes the load among processors i and i + 1, in a way that insures that 
after the load balancing step, the sum y p£1 wy is closer to I:~'_s~ I w~ = (p - 1)Wa. In other words, the 
work to the right is close to the desired average. Since this is true for all i, the work on each 
processor wi after the load balancing step will be closer to the desired work w~. However, since 
there is a definite direction to the algorithm and only neighbor transactions are allowed, it is 
possible that up to p - I steps could be required to distribute the load, assuming that the work 
on each processor doesn't change between load balancing steps. Such an unbalanced load is very 
unlikely, however, and in practice only a few steps are needed to balance the load. As the solution 
develops, the load will change slightly as important solution features move from place to place. 
Since the solution changes lowly, the partitioning of processors among the domain will change 
slowly, and this algorithm can easily keep the load well distributed. 
The cost of this algorithm is roughly CI + 2CT. However, as the global data exchanges require 
the processes to synchronize, this algorithm also costs idle time while every processor waits for the 
slowest processor to catch up. 
In practice, this algorithm worked much better than a purely local algorithm which compared 
the work on a processor to its left and right neighbors. The purely local algorithm had a tendency 
to produce local peaks of work. To see why, consider the situation where processor i has work 
w and processor i + 1 has work ½w. Then processor i might send lw work to i + 1. But the sum 
of work to the right (processors i + 1 . . . . .  p - 1 may be greater than (i - 1)w~. The consequence 
is a peak in the work at some processor between i + 1 and p - 1. This peak can be quite high in 
practice (several times wa). 
However, the algorithm described above does have its disadvantages. Because it requires ome 
global data (the work estimate w~), the processors must wait until every processor is ready. On a 
linear array this does not impose much of a penalty, as the load balancing algorithm will insure 
that most of the processors finish at the same time. For the bus architecture, the situation is more 
complicated. If the loads are not  well balanced, then some processors will be waiting while others 
are computing. Those that are waiting can use the bus to send data to other processors. This 
effectively increases the bandwidth of the bus by overlapping computation (on some processors) 
with data transfers (on others). Load balancing may actually make things worse by reducing or 
eliminating the amount of data transfer which can be overlapped with computation on a bus. 
9. EXPERIMENTAL RESULTS 
To test the above theory, we ran a parallel mesh refinement code on a 14 node Apollo ring. The 
ring consisted of 10 DN300s and 4 DN420s. The Apollo DN300 is a 68010 based workstation with 
no hardware floating point. The Apollo DN420 is a 68000 based workstation with hardware 
floating point and a hard disk. The Apollos are connected by a token passing network; at the user 
384 W.D.  GROPP 
level, processes on different processors communicate hrough mailboxes which live on a disk. The 
DN300s were used for the actual computations, with the DN420s being used to hold the actual 
mailboxes, monitor the computation, and act as partner to the DN300s. There is a fixed I/O 
bandwidth which is independent of the number of processors, so this configuration is similar to 
a bus architecture and will be compared with our results for a bus. 
The actual program took about 2 man-weeks to code, including debugging, starting from a 
working serial algorithm. The short time is a result of two main factors: the refinement strategy, 
by working with multiple grids, gives a natural way to decompose the problem, and the message 
passing paradigm (as compared to shared memory) made debugging easy. 
The sample problem is a variant of the revolving cone problem used in [3] and [7]. The problem 
is: 
with the initial condition 
with 
and where 
ut - yUx + xuy = O, 
1 - 16r, if r <~6, 
u(x,  y, 0) = 0, otherwise, 
Ixl ~< 1, ly] ~< 1, 
I 2 3 2"~ r =((x - i )  + iy  ,. 
The boundary conditions at the inflow boundaries are zero. The solution to this variable coefficient 
problem is given by rigid counter-clockwise rotation of the initial data about he origin with angular 
frequency co = 1. The region of refinement is concentrated around the cone; hence no static 
partitioning will make effective use of the available processors. In our test, we use Lax-Wendroff 
as the difference approximation, with inflow boundaries specified as u = 0 and outflow boundaries 
specified with first order extrapolation. A graphic view of the form of the solution can be seen in 
[3] and [7]. 
The processors were divided up as equal sized strips in y; for the two processor case, the initial 
domains were - 1 ~< y ~< h and 0 ~< y ~< l, where h is the space step size. This extra h is the overlap 
which we required of the coarse grids; see the caption for Fig. 2. The values of the various 
parameters are: 
CT/ C l ,~' 0.16p, 
Rg = 10,  
R=2,  
nc = 51 x 51, 
n:~.. 0.2no (9.1) 
The value for CT/C~ = ~tp was determined by comparing the time for the unrefined grid calculations 
given in Table 1 to 
Tp=~ V/-~ + 2~p =a,p  
A least squares fit to the data for the unrefined calculations (for p = 2, 4, 6, 8, 10) determined 
a~ = 1814 and a2 = 11.5. As can be seen from Fig. 5, the fit is quite good. The value o f t  for these 
parameters i fl = 0.0067p. 
Three sets of runs were made. The first run used no refinement and is a simple test of the parallel 
integration algorithm. The second test used fixed partitions. The third test used dynamic load 
balancing to modify the partitioning in an attempt to more equally distribute the load. Both the 
second and third test computed the same result, which is shown in Fig. 6 for a time = 3.375, which 
is just over one half a revolution of the cone about the origin. In the load balancing computations 
Local uniform mesh refinement on loosely-coupled parallel processors 385 
Table 1. Times for parallel computation: (a) is for no refinement, (b) is for refinement 
without load balancing, and (c) is for refinement with load balancing 
Elapsed time Max CPU t ime Average CPU time 
Processors a b c a b c a b c 
I 1714 3482 NA 1699 3450 NA 1699 3450 NA 
2 912 3025 2468 887 2940 2199 872 2089 1820 
4 527 2149 1625 495 2026 1379 462 1152 1092 
6 393 1770 1385 359 1491 1133 325 815 814 
8 329 1521 1128 290 1328 858 256 656 668 
10 267 1259 1048 222 1083 723 215 555 582 
2000 
1500 
\ 
\ 
\ 
\ 
\ 
\ 
5o0 
o I 
2 
[]  
I I I J 
4 6 8 10 
Processors 
Fig. 5. Comparison between the predicted ( - - - )  and observed (rq) times for a no-refinement calculation. 
Fig. 6. Computed solution by the parallel algorithm at time = 3.375. 
386 W.D.  GROPP 
3000 -\  "',, 
. "",,, 
",0 
~ "\\"'. ".,, . 
\ \ \ ' "  ".. . . . .  
t000 ~ 
0 I I I I I 
2 4 6 8 40 
Processors 
Fig. 7. Timing results for experiments. The dam is from Table l ;  a,b, and c have the same meiu~ngs. The 
solid lines are the total elapsed times, the dotted lines are the maximum CPU times and the dashed lines 
are the average CPU times. 
(test 3, labeled "c"), refinements were present in as many as 5 processors at the same time. During 
all computations, n/remained relatively constant. Also, note that as a feature of LUMR, unneeded 
refinements are removed every R~ steps. The results are presented in Table 1 and graphically in 
Fig, 7. 
The data in column "a" shows the time for a no-refinement calculation on the coarse grid alone. 
This data gives an idea of the efficiency of the implementation in the absence of any adaptivity. 
Indeed, we can see that the ten processor solution takes about 1.6 times as long as the perfect 
speedup. The loss here is due to the communication time. Column "b" shows the times for a 
no-load balancing calculation and column "c" shows the times for a load balancing calculation. 
Comparing the average and maximum CPU times for these calculations, we see that a factor of 
two separates the maximum CPU times for the load balancing and no-load balancing programs. 
3000-- 
2000 
8 
S 
I 
i x . . \  
",.,\\ 
I I I I 
4 G 8 10 
Processors 
Fig. 8. Comparison of  predicted elapsed times with the observed times with load balancing. The solid line 
is the observed time. The dashed line is the predicted time w i th f  = 0.25, and the dotted line is the predicted 
time with f = 0.20. 
Local uniform mesh refinement on loosely-coupled parallel processors 387 
One feature of the algorithm is that the solution is independent of the number of processors (or 
subdivisions) used, and this was observed with our calculations. In all cases, the computed solution 
showed small error, with the mesh refinement solution showing more accuracy, similar to the 
demonstrations in [3]. 
In Fig. 8, we compare our experimental results with our predictions. We see that the estimated 
elapsed time is greater than we observed, but only by about 20%. The discrepancy is probably due 
to a combination of lack of load balancing and to items we have neglected, such as the 
computational time in handling the refinements (regriddings, updatings, data structure transfers). 
They clearly show that 10 Apollo DN300s are about all that could be used efficiently with this 
algorithm, because the interprocessor communication mechanism is too slow, and that our theory 
is an accurate model of the algorithm. 
10. CONCLUSION 
The partitioning of the problem by domain rather than by grid was the key in keeping this 
algorithm efficient. This same approach may also be applied to less regular grids, as long as there 
is some way to solve the problem on a partial domain such as using an explicit method or an 
iterative technique such as the Schwartz alternating procedure. Such a technique may work well 
with implicit integrators and with time independent problems. 
In three dimensions, the analysis is much the same, using slabs instead of slices. Because the size 
of the interface between slabs is proportional to the two-thirds power of the number of points 
(one-half in the 2-D case), the overhead will be more significant. In this case then it is probably 
better to divide the domain into cubes rather than slabs to reduce the overhead. 
It is clear from the results that bus oriented architectures will have trouble once the number of 
processors gets very large. The algorithm we have described was designed to significantly reduce 
the amount of data being moved between processors, yet even at 10 processors, we are almost at 
the minimum of the elapsed time curve (12 is the predicted minimum). Note that these processors 
are very slow;//would be larger for a system using state-of-the-art processors and communication. 
Some of the cost in the p terms was due to excessively general communications software. This 
communications overhead rapidly increases //, and as our theory shows, the speedup is very 
sensitive to changes in t/. On the other hand, our analysis hows that a linear or mesh-connected 
array of processors offers far more potential speed up, even for much larger values of//. 
REFERENCES 
1. D. Gannon and J. Van Rosendale, On the impact of communication complexity in the design of parallel numerical 
algorithms. Technical Report 84-41, ICASE (1984). 
2. E. Clementi et aL, Large scale computations on a scalar, vector, and parallel "supcrcomputer". In Structure and 
Dynamics of Nucleic Acids, Proteins, and Membranes, pp. 403-450. Plenum Press, New York (1986). 
3. M. J. Berger and J. Oliger, Adaptive mesh refinement for hyperbolic partial differential equations. Technical Report 
Manuscript NA-83-02, Stanford University (1983). 
4. J. H. Bolstad, An adaptive finite difference method for hyperbolic systems in one space dimension. Technical Report 
LBL-13287-rev, Lawrence Berkeley Laboratory (1982). 
5. W. D. Gropp, A test of mesh refinement for 2-d scalar hyperbolic problems. SIAM dl scient, statis. Comput. 1/2, 
191-197 (1980). 
6. W. D. Gropp, Local uniform mesh refinement for elliptic partial differential equations. Technical Report 
YALE/DCS/RR-278, Yale University, Department of Computer Science (1983). 
7. D. Gottlieb and S. A. Orszag, Numerical Analysis of Spectral Methods: Theory and Applications. Society for Industrial 
and Applied Mathematics (1977). 
