Stencils and problem partitionings: Their influence on the performance of multiple processor systems by Reed, D. A. et al.
NASA-CR-178102
19860016523
NASA Contractor Report 178102
leASE REPORT NO. 86-24
leASE
I
i
I)
~-----------
STENCILS AND PROBLEM PARTITIONINGS:
THEIR INFLUENCE ON THE PERFORMANCE OF MULTIPLE PROCESSOR
SYSTEMS
Daniel A. Reed
Loyce M. Adams
Merrell L. Patrick
" .... ;
.. -" .. ; ,
, __ t.
,,-'-., "-r"'~ '«"..~~ .... -:.., ... "''''"' ~,,~-.. --::~;-'l~.
, ..... ""'-.,;~
Contract Nos. NASI-17070, NASl-18107
May 1986
INSTITUTE FOR COMPUTER APPLICATIONS IN SCIENCE AND ENGINEERING
NASA Langley Research Center, Hampton, Virginia 23665
Operated by the Universities Space Research Association
NI\SI\
National Aeronautics and
Space Administration
Langley Research center
Hampton,Virginia 23665
LANGLEY RESEARCH CENTER
LIBRARY, NASA
H,I\~·.1PTON, VIRGINIA
https://ntrs.nasa.gov/search.jsp?R=19860016523 2020-03-20T14:23:05+00:00Z

Stencils and Problem Partitionings:
Their Influence on the Performance of Multiple Processor Systems
Daniel A. Reed t
Department of Computer Science
University of Illinois
Urbana, Illinois 61801
Loyce M. Adama t
Department of Applied Mathematics
University of Washington
Seattle, Washington 98195
Merrell L. Patrick t
Department of Computer Science
Duke University
Durham, North Carolina 27706
ABSTRACT
Given a discretization stencil, partitioning the problem domain is an important first
step for the efficient solution of partial differential equations on multiple processor
systems. We derive partitions that minimize interprocessor communication when the
number of processors is known a priori and each domain partition is assigned to a different
processor. Our partitioning technique uses the stencil structure to select appropriate
partition shapes. For square problem domains, we show that non-standard partitions
{e.g., hexagons) are frequently preferable to the standard square partitions for a variety of
commonly used stencils. We conclude with a formalization of the relationship between
partition shape, stencil structure, and architecture, allowing selection of optimal partitions
for a variety of parallel systems.
t Research supported by the National Aeronautics and Space Administration under NASA
Contract Numbers NASI-17070 and NASI-18107 and by the U.S. Air Force Office of
Scientific Research under contract No. AFOSR-76-2881 while the authors were in residence
at ICASE, NASA Langley Research Center, Hampton, VA 23665
Merrell L. Patrick was also supported in part by NASA Grant Number NAG-I-466.
Daniel A. Reed was also supported in part by NSF Grant Number DCR 84-17948 and
NASA Contract Number NAG-I-613.

1. Introduction
Problem transformation has long been among the most successful solution paradigms. As
an example, consider the solution of elliptic partial differential equations fOrte85]. Given some
planar region R, the classical central difference technique covers the region R with a rectangular
grid and replaces the derivatives at each grid point with central differences. The resulting system
of linear equations is then amenable to solution via a variety of efficient algorithms. This
transformation, from partial differential equation to linear system, makes the solution both
feasible and attractive. Within this framework there remain several alternatives, both in the
choice of discretization stencil (e.g., 5-point or 9-point) and the linear system solver (e.g., direct
or iterative)_ and the most appropriate choices depend on the problem.
When one considers parallel solution of partial differential equations, an additional
paradigm, problem domain decomposition [Voig85], arises. If multiple processors are to
cooperate, each solving the linear equations on a portion of the grid, the selection of grid
partitions and their assignment to processors are crucial to good performance.
In this paper, we consider the parallel solution of elliptic partial differential equations over a
planar region, using both shared memory and message passing architectures. Historically, only
rectangular partitions of the discretization grid have been assigned to processors, primarily
because the resulting data structures are regular. However, triangles, squares (a special case of
rectangles), and hexagons also tessellate the plane. The effects of these partitions on inter-
processor communication and their relation to the discretization stencil are investigated.
Because partitions like hexagons have a higher area to perimeter ratio than rectangles and
potentially less interpartition communication, there is incentive to investigate their attributes.
Our results show that the efficiency of the parallel solution depends on the partitioning of
the discretization grid, its associated stencil, and the underlying architecture. Observing that the
2amounts of required computation and communication are functions of a partition's area and
perimeter, respectively, we compare the performance of a variety of associated stencil/partition
pairs on both message passing and shared memory architectures. However, we begin with a
survey of related work and a formal specification of the problem.
1.1. Related Work
In a study of hypercube performance, Fox and Otto [Fox84] recently noted that the
efficiency of a parallel algorithm is not determined by the amount of communication but the
ratio of communication to calculation. In their study, they considered the solution of Laplace's
equation over a square region using a 5-point discretization stencil. Their partitioning placed
squares of grid points on each node of the hypercube, using only nearest neighbor
communication. This choice of partitioning has a lower communication to computation ratio
than the natural alternative, partitioning the grid into an equal number of rectangular strips.
Vrsalovic, et al. [Vrsa85] have also considered the solution of Poisson's equation over a
square region using a 5-point discretization stencil. Unlike Fox and Otto, they tested triangular,
square, and hexagonal partitions. Their study used the ratio of processing time to data access
time as one performance metric when comparing the speedup of different partitions on a general
class of multiprocessor systems. Their hypothetical multlprocessor systems were assumed to
have both local memory attached to each processor and global memories accessible via an
interconnection network. Of the three partltiolls, hexagonal decomposition produced the largest
speedup.
In an experimental study, Saltz, et al. [Salt86] considered solution of the heat equation using
successive over-relaxation (SOR) on an Intel iPSC [f_,al,t,_5]. Rectangular strips and squares were
used as grid partitions. They observed that the Intel iPSC's high startup costs for message
3transmission often favored decreasing the number of messages sent, even if that meant sending
more bytes of data. Hence, partitions of rectangular strips were often more efficient that square
partitions.
Superficially, these results by Fox and Otto, Vrsalovic, et al., and Saltz et al. seem
mutually contradictory - each favoring different partition shapes. However, these studies
considered only a small portion of the possible parameter space of stencils, partitionings, and
architectures. Moreover, their underlying assumptions differ. This paper presents a formal
method for analyzing stencil/partition/architecture triplets and applies this method to a variety
of these triplets. Section 2 begins by computing the total number of points in a partition versus
the number of points that must be communicated for several common stencils using each of the
rectangular, square, triangular, and hexagonal partitions. In section 3, these results are used to
determine those stencil/partition pairs that maximize the ratio of computation to
communication. Finally, section 4 compares the performance of an algorithm for solving
Laplace's equation over a square region using different stencil/partition pairs on both shared
memory and message passing architectures.
2. Communication Costs for Selected Stencil/Grid Partition Pairs
Elliptic partial differential equations, particularly the Laplace and Poisson equations, have
long been used as test vehicles for new solution algorithms and parallel architectures.
Consequently, our study is based on the following problem formulation.
The Problem: Consider an elliptic partial differential equation with Dirichlet
boundary conditions on some square region R. If R is discretized to
contain N = n2 points, we wish to solve the resulting linear system
using a point Jaeobi iterative solver on a parallel processor
containing p processors (PEs), where p __ N.
4One interesting question immediately arises. Suppose the grid were divided with each
partition placed in a different PE and that each PE used the point Jacobi iterative solution
technique. 1 In this scenario, each PE repeatedly updates its partition of grid points and sends
values associated with its partition boundary to logically adjacent partitions. What partition
structure would maximize the ratio of computation to communication? One immediately
observes that
• computation is a function of a partition's area,
• communication is a function of a partition's perimeter, and
o the partition's perimeter that must be sent to other partitions is a function of the stencil.
As an example, Figure 2.1 illustrates square partitions with a 5-point stencil. Each partition
communicates with four neighboring partitions, and the amount of data transferred is directly
proportional to the perimeter of the partition. Although convergence checking for an iterative
scheme also involves communication, the amount and cost of this communication is independent
of stencil type and partition shape and will not be considered. (It is interesting to note that the
communication required for the inner products of the conjugate gradient method is also
independent of stencil type and partition shape.)
In the remainder of this section, we analyze the expected amount of data that must be
transferred between partitions, given possible stencil/partition pairs. In a later section, we will
consider the influence of parallel architecture on the choice of a stencil/partition pair.
1The iterates generated by our parallel Jacobi method are the same as those generated by the sequential Jacobi
method. We also emphasize that our analysis techniques can be applied to other point iterative solvers (e.g., mul-
tieoh)r SOR and conjugate gradient) as well.
Ui,j-f-1
• o • • • • •
• • • U • • • IZi_l,___ Ui,y Ui+ld
8 • • _ _ _ • • • • •
• • a U • • •
@ @ D • • • •
o • o • • • • Ui,i_l.
CO_lt_tl'_a_on _sg i I ]communication _-_4s ui,J : --4 ui-t'J + Ui+l'J + ui'J-1 -}- ui'j+l
(b)
Figure 2.1 Square partitions with 5-point stencil
2.1. Five Point Stencil
Figure 2.1b shows the 5-point stencil and the equations for the unknowns in Laplace's
equation that arise from the standard centered difference approximation to the partial
derivatives. With an iterative solution of these equations (e.g., point Jacobi), the new value
computed at each grid point depends on the previous values from its north, south, east, and west
grid point neighbors.
Using this stencil, we now consider the influence of partition shape on inter-PE
communication. To ease comparison, we assume each partition contains n2/p grid points (i.e.,
each PE's computation is proportional to n2/p).
62.1.1. Rectangular Partitions
Suppose the grid of n 2 points were partitioned into p horizontal strips, and each strip were
r
again partitioned into r rcctangles; scc Figure 2.2a. Assuming all rectangles are of equal size,
n 2
each contains _ grid points with sides n and n--Lr. As illustrated in Figure 2.2b, the perimeter
p r p
2{--n + nTr/--4 grid points and all are involved in data transfer. I-Iowever, the four
I"1
contains
[ p ]r
corner points in each rectangle involve two (2) data transfers. Therefore, the data transferred
from each interior rectangle is 2 [_ + f].
To find an optimal value for r, the number of horizontal rectangles, we need only maximize
the ratio of computation to communication
n 2
F(r) -- max P - max nr
r>l { r>l 2(p.r 2)r<_p 2 n+ nr r_pI
r pj
" n ,_
• " " p//r
. _ _ _ _ _ d, _ nr points
• • • • • _- p
n ? ? T ? ?
• • • 2
n
.... 1 -- points
r
1 2 r
(a) (b)
Figure 2.2 Rectangular partitions with 5-point stencil
7in a single PE. Differentiating and setting the derivative equal to zero, we obtain p=r 2 or
r=_/p as the optimal value of r. Therefore, squares are the optimal rectangular partitioning for
4n
the 5-point stencil with a communicating perimeter of --_p . With the 5-point stencil, this result
has a simple geometric interpretation: of all rectangular partitions, the square maximizes the
area/perimeter ratio.
Finally, as an interesting special case, note that if r = 1, the grid of n 2 points is partitioned
n 2
into p strips each containing -- points. In this case, there is no communication to the east or
P
west and 2n -- 4 values (n - 2 north and n -- 2 south) are communicated from each partition. 2
2.1.2. Triangular Partitions
p__
To partition an nXn grid into p triangles we assume n = 2v_p-pI and divide the grid into 2
P squares will contain 81z grid points. Each of the
squares with sides s = 2X/-21. Each of these 2
squares is then divided into the two "approximate" triangles shown in Figure 2.3a. Each of the
p triangles contains 412grid points and has height s and base s-1.
Now consider the communicating perimeter of the upper triangle in Figure 2.3a, assuming a
5-point stencil. By observation, s values are sent north, s-1 values east, s values south, and 1
value to the west, for a total of 3s. Note that s - 2 of the values transmitted south are used
twice by the receiving triangle. The other triangles are reflections of this case. Because
3V_n
n = 2Vppl and s = 2V_l, the total number of values sent from each triangle is _.
2The four corner points of the partition are fixed boundary values that need not be transmitted.
8• • • • • • • • • • • @
"'''_ _'''"
_ " " _ " " ___ " "
y _
_ : :_
s 1 __
• • . . _ • • • ._::.-.-:-_ _ ._::....:y.o
• • • .,_ • . -/- . • .
: .'."::_." V:: :::
Ca) (h)
Figure 2.3 Triangular partitions with 5-point stencil
2.1.3. Hexagonal Partitions
Now consider dividing the n×n grid into p hexagonal partitions. We again assume that
n 2
n = 2N/-p-plimplying each partition has _ = 41z grid points. Figure 2.4 shows how this
P
partitioning can be accomplished. Each hexagon has l + 1 grid points at the north and south
edges and l grid points on each of the four remaining sides. The number of grid points in the
upper or lower half of each hexagon is
1+1)+2(i--1 =212,
for a total of 412 in each hexagon.
9• • • • • l
Q • • • • • •
@ • • • • • •
• • • • • • •
Figure 2.4 Hexagonal partitions with 5-point stencil
As Figure 2.4 shows, l . 1 values must be sent north, 1 + 1 values south, l northeast, l
southeast, l southwest, and l northwest, a total of 61-4-2 . Because l - 2_-' each hexagon must
communicate _ +2 values.
Vp
2.2. Nine Point Stencil
The 9-point stencil, shown in Figure 2.5, is a higher order finite difference approximation to
the partial derivatives than the 5-point stencil discussed earlier. When using this stencil, the
iteration value computed at each grid point is a function of its north, northeast, east, southeast_
south, southwest, west, and northwest grid point neighbor values. In this section we examine the
amount of inter-PE communication for the same partitions discussed earlier and observe the
change in a partition's communicating perimeter as the stencil changes.
10
Ui-l,j+l ui,j+l Ui+l,j+l
Ui-l,j--1 ui,j-1 Ui+l,j
Figure 2.5 9-point star stencil
2.2.1. Rectangular Partitions
Figures 2.2 and 2.5 show that the communicating perimeter of rectangular partitions for
the 9-polnt stencil is nearly the same as the communicating perimeter for the 5-point stencil.
Only the four corner points of a partition are each involved in an additional communication. As
before, squares are the optimal rectangular partitioning with a communicating perimeter of
4n
X/p- . 4. Because there is no communication to the left or right, rectangular strips (r = 1) have
the same communicating perimeter for both the 5 and 9-point stencils.
11
2.2.2. Triangular Partitions
The dashed lines between grid points in Figure 2.6 highlight the additional communications
required for triangular partitions when using the 9-polnt stencil rather than the 5-point stencil.
The solid lines between grid points are the communicating perimeter for the 5-point stencil (3s).
The 9-point stencil requires the following additional communications: 1 to the northeast, 1 to
the southeast, 1 to the northwest, 1 to the southwest, and s-2 to the south. This yields a total
communicating "perimeter" for an interior triangular partition with the 9-polnt stencil of
4_/2_+2
4s -}-2 or _ . "Perimeter" is perhaps a misnomer here, for the perimeter of points along
the diagonal in Figure 2.6 is "two deep" for the 9-point stencil.
Figure 2.6 Triangular partitions with 9-point star stencil
12
2.2.3. Hexagonal Partitions
The dashed lines in Figure 2.7 illustrate the the additional communications required with
hexagonal partitions when using the 9-point stencil instead of the 5-point stencil. The solid lines
of Figure 2.7 correspond to the communicating perimeter of the 5-point stencil, shown to be
61 + 2 in section 2.1.3. The 9-point stencil requires l communications to the northeast,
southeast, southwest, and northwest in addition to those for the 5-point stencil. This gives a
total communicating perimeter, for interior hexagonal partitions, of
5n
101 +2 -- _--_-_.2
where
111 .
"0 • • • • " .
"0 @ • @ • • @" •
• • • • • • I • ..
•
Figure 2.7 Hexagonal partitions with 9-point star stencil
13
Note that the communicating "perimeter" is depth 2 along four of the six edges.
2.3. Other Stencils
Many stencils other than the 5-point and 9-point stencils analyzed above are frequently
used when solving partial differential equations. Figure 2.8 illustrates some of the most common.
For brevity's sake, we do not include the analysis of the communication required for their
associated partitions. However, the results of this analysis are summarized in Table 2. The
interested reader can verify these results by applying the methods discussed earlier to compute
the additional grid points involved in data transfer for each of these stencils.
ui,i+ 2 ui,i+2
t/i-l'j+la, u/'J+le ui,I+l u/-1,j+l ui,l'+l tti+l,'+ 1
Ui-l,j_tt),, Ui+l,, tti-2,] _//_1,] uy j uiq-1,, ui+2, , t/i_2, , Ui-l,$'_uJ,_+l, , 15/+2,,
" I1 : '!" ----- • •
Iti-1, , +1,,
t I"_i, -2• Ui -2
7--point 9--cross 13--point
Figure 2.8 Frequently used discretization stencils
14
2.4. Computation/Communication Ratios
Before summarizing the results of the previous section, we introduce the notation shown in
Table 1. Using this notation, Table 2 shows relative amounts of computation and
communication for selected stencil/partition pairs. For simplicity, the effects of boundaries on
communication have been elided. (Recall that n 2 is the number of grid points, and p is the
number of processors.) Table 2 also includes one quantity not discussed earlier, parallel
communication, the amount of data transfer if partition sides can communicate in parallel. This
parallel communication will later allow us to determine if the optimal stencil/partition changes
when communication to neighboring partitions can be done in parallel.
The entries of most interest in Table 2 are the ratio of computation to communication (R)
and the ratio of computation to parallel communication (PR). Table 3 illustrates the relative
magnitude of these quantities for a square grid containing 256X256 points and a parallel system
with 64 processors.
Table 1 Static scheduling notation
Quantity Definition
n 2
Comp --, the computational complexity of a stencil/partition pairP
Comm communication complexity of a stencil/partltion pair
Pcomm parallel communication complexity of a stencil/pair
R the ratio Comp
Comm
PR the ratio Comp
Pcomm
15
Table 2 Summary of stencil/partition analysis
Partltlon Stencll
5-point 7-point 9-point star 9-point cross 1S-point
Rectangular
Strips
Comm: 2n 2n 2n 4n 4n
Pcomm: n n n 2n 2n
n n _ n 17,
R: ....
2p 2p 2p 4p 4p
n n 71, n
PR: .....
p p p 2p 2p
Trlangle 3_/_n 3V_n 4V_n 6V_n 8V_n _
Cor_m- Vp _7--p+2 _+2 7p _+_
X/_n ZX/_n Z 2V_n Z ZX/_,_1 2_/'_n 1
n n n n
R: 3---_p --3----_p --:i:_-_p 6----_p --
n n n n n
Square
4n 4n +2 4n +4 8n 8n
n n n 2n 2n
n n n n 7/,
n: 4vT --4-V7 _-4-V7 _ _-8-V7
n n n n n
PR: _ _ _ _ 2%/_.
Hexagon
3n +2 4n +2 5n 2 6n +4 6n +8Co_- _p _ 7; + _p -q7
n n n n 2 n 2
Pcomm: 2---_p +1 "_p _ -_p+ -_p+
P._ n n
2n n n n n
NOTE: Comp = n2/p is used in computing R and PR in all cases.
16
An inspection of Table 3 shows that hexagonal partitions yield the highest ratio of
computation to serial communication, except for the 9-point star stencil, where squares are
better. However, if one assumes the inter-partition communication can be done in parallel (i.e.,
all edges of a partition can be transmitted in parallel), hexagons yield the highest ratio in all
cases. With parallel communication, the improvement obtained with hexagons is even greater
PRhezag°n -- 2).
(e.g., Rhezag°nR#quare-- 1.33 for the 5-point stencil but PRsq uare
The patterns in Table 3 suggest there is some formal relation between partitions and
stencils, with certain combinations preferred. In the next section we develop techniques for
selecting optimal partition/stencil combinations.
Table 3
Ratio of computation to communication (n = 256 and p = 64)
Partition Type Stencil
5-point 7-point 9-point star 9-point cross 13-point
Rectangle
R: 2 2 2 1 1
PR: 4 4 4 2 2
Triangle
R: 7.5 7.5 5.65 3.75 3.75
PR: 22.5 11.3 11.3 11.3 11.3
Square
R: 8 8 8 4 4
PR: 32 32 32 16 16
Hexa$on
R: 10.66 8 6.4 5.3 5.3
PR: 64 32 32 32 32
17
3. Determining Optimal Stencil/Partition Pairs
Using the following definition, a partition can be categorized with respect to a given stencil
by the number of partition perimeters that must be communicated.
Definition: A partition is a k-partition with respect to stencil S if k perimeters are
communicated when stencil S is used.
For example, the square is a 1-partition with respect to the 5-point, 7-point, and 9-point star
stencils but is a 2-partition with respect to the 9-point cross and 13-point stencils. The hexagon
is a 1-partition for the 5-point and a 2-partition with respect to the 9-point cross and 13-point
stencils.
Moreover, the value of k can be a fraction. The hexagon, for example, is a 12 partition for6
14 partition with the 9-point star stencil. Why? Because only somethe 7-point stencil and a 6
sides of the hexagon are involved ir_ multiple data transfers. This categorization of partitions
with respect to stencils provides a ranking mechanism for stencil/partition pairs. Hence, one can
determine those stencils where l-partition hexagons are preferable to k-partition squares.
When communication from a partition to each of its neighboring partitions is done serially,
4kn
the communicating perimeter for square k-partitions is nearly -_---_, and the corresponding ratio
n
of computation to serial communication is 4k%/p. The communicating perimeter for hexagonal
3In
l-partitions is approximately -_--_-p,and the corresponding ratio of computation to serial
12
communication is 3/-f--_p" Clearly, an l-partition hexagon yields a higher ratio when
n n
3t/p >  -VYp
or when
18
k > 31. (3.1)4
If one adopts parallel rather than serial communication, the communicating perimeter for
kn
square k-partitions is, except for a small constant, -_p, and the ratio of computation to parallel
communication is n
kX/p" Similarly, the communicating perimeter for hexagonal l-partitions is
In 2n
and the corresponding ratio of computation to parallel communication is 3-ffl-_p" With
parallel communication, l-hexagons are preferable to k-squares when
2n n
l_/p > k---_p
or
k > --.l (3.2)2
Using inequalities (3.1) and (3.2), Table 4 shows optimal stencil/partition pairs, based on
the maximum ratio of computation to communication. Table 4 shows that square partitions are
better than hexagons in only one of the 10 cases. Note that the k and l-values for parallel
communication in Table 4 were obtained by rounding the fractional values for serial
communication up to the next largest integer (i.e., a parallel communication of 12 perimeters6
requires two transmissions). Based solely on Table 4, hexagonal partitions are superior to square
partitions because they minimize the interpartition data transfer. 3 Similarly, triangles are clearly
inferior.
aAs we shall see, the underlying parallel architecture also influences the choice of partition shape.
19
Table 4 Comparison of Square and Hexagonal Partitions
Square k-value Hexagon l-value Optimal partition Optimal partltlon3l l
-- parallel: k >Stencil (serial, parallel) (serial, parallel) serial: k > 4 2
1
3 hexagon(1 > _-)5-voirtt (1,1) (1,1) hexagon (1 > T)
7-point (1,1) (12,2) equal (1 = T.T)34 equal (1 = 12)
3 10, = 12)9-point 8tar (1,1) (14,2) square(1< -_-'T) equal (1
9-point cros_ (2,2) (2,2) hexagon (2 > 3'2) hexagon (2 > I)
13-point (2,2) (2,2) hexagon (2 > 3"2) hexagon (2 > 1)
4. Architecture and the Performance of Stencil/Partition Pairs
Our previous analysis did not include architectural considerations, save for the inclusion of
results for both serial and parallel communication. However, the stencil and grid partition
cannot be divorced from the processor connectivity of a message passing architecture (e.g., square
or hexagonal grid) or the storage schema used in a shared memory multiprocessor. Optimal
performance can be achieved only via judicious selection of a trio: stencil, partitioning, and
architecture.
Deriving expressions for parallel execution times and speed-ups for a
stencil/partition/architecture trio requires a model of execution. Our parallel execution time
model is a variation of one we developed earlier [Reed85] and is similar to the one used by
Vrsalovic, et al. [Vrsa85]. In this model, the parallel iteration time for evaluating one partition
of grid points is
tPcvc_erocess°r= tcomp -{-ta Jr"tw
20
where tcomp is the iteration computation time, 4 is tilt: data access/transfer time, and tw is the
waiting/synchronization time.
The computation time tcomp depends on the partition size and stencil, and is independent of
the architecture except for the time, Tfp, to execute a floating point operation. Formally, tcornp is
toom,= r1,,P
where E(S) is the number of floating point operations required to update the value of a grid
n 2
point, given a stencil S, k is the number of grid points in a partition, and Tip is the time for aP
single floating point operation.
The speedup obtained using parallel iterations is simply
tuni?rocessor
Sp __ -cycletP--yroces,or' (4.1)
cycle
where the single processor iteration time is just
E(S).2TI,.
Specific values for the speedup depend not only on the trio of stencil, partition, and network
chosen, but also on the technology constants (e.g., floating point operation time and packet
transmission time).
The other components of the execution time model, ta and tw, depend on the particular
combination of partitioning, stencil, and architecture and are analyzed below.
4.1. Message Passing Architectures
Among the competing classes of parallel machines, message passing architectures occupy an
important niche. The recent emergence of commercial message passing machines (.e.g., the Intel
21
hypercube [Raft85]) has stimulated great interest in this area.
Each processor in a message passing machine contains a local memory and is connected to a
(necessarily) small number of other processors. Access to data contained in another processor's
memory requires transfer of that data via the interconnection network. Clearly, the performance
of a stencil/partition pair depends heavily on the performance of the interconnection network of
the multiprocessor system. Although a plethora of interconnection networks have been proposed
[Reed83, WittSl], Figure 4.1 shows those networks (meshes) that are directly relevant to iterative
solution of elliptic partial differential equations. Each interconnection network has an associated
"natural" partition (e.g., square partitions on a square mesh).
Consider an interior processor in one of the partition/mesh pairs. During each iteration
(cycle), two groups of data must cross each communications link, one in each direction from
neighboring processors. There are several possible interleavings of computation and remote data
access. These range from a separate request for each communicating "perimeter" grid point
when it is needed to a request for an entire "side" of the communicating "perimeter" of the
partition. These requests can, in turn, be either overlapped or non-overlapped with
computation. Similarly, the hardware support for interprocessor communication must be
specified. A simple hardware design allows only one link connected to each processor to be active
at any time, increasing the data transfer time. With additional hardware, each processor link
can be simultaneously active.
Each combination of data access patterns and hardware design alternatives leads to an
implementation with different performance characteristics. Rather than cursorily examine a wide
variety of alternatives, we have chosen to examine a smaller set in detail. Specifically, we assume
* communication links are half-duplex (i.e., data can flow along links in only one direction at
a time) and
22
[]
[]
[]
[]
]
Figure 4.1 Selected interconnection networks
• processors request and wait for all perimeter values before starting computation.
Currently, these assumptions correspond to all commercial hypercube implementations [Raft85].
Whether the communication is serial or parallel, some processor P; in the interior of the
network will need data from another processor Pi that is some number of links lq away. (See
Table 5 for notation.) The amount of data to be transmitted, d_y(S, P), depends on both the
stencil S and the grid partitioning P. Ignoring synchronization and queueing delays, the time to
transmit data from Pi to PJ, crossing lij links, is
23
Table 5 Execution time model notation
Quantity Definition
dlj(S,P ) amount of data sent from i to j
lii number of links between i and j
P partition
Pi processor i
Ps packet size
S discretization stencil
Sp speedup
Tfp time for a single floating point operation
tparalle! parallel access time
tseriat serial access time
a
tcomm time to send a packet across one communication link
tcycte time for one iteration
time (possibly zero) to interrupt an intermediate processor
tf°rward and forward a message
t,end(i,j) data transmission time from processor i to j
t#tartup overhead for preparing a communication
tseriat serial waiting timeto
dij(S, P) ]t"_d(i'J) = t't"t_P + Ps liyte°mm+ (lli -- 1)tI°'to_*d' (4.2)
where tst=rtup is the fixed overhead for sending data, tcomrn is the packet transmission time, and
tlorto=,ais the message forwarding overhead incurred at intermediate processors. The ceiling
function reflects the redundant communication due to the fixed packet size Ps.
In general, data destined for other processors will encounter queueing delays, both at their
origin and at intermediate nodes. The latter is expected, but the former is counter-intuitive. As
an example of this phenomenon, consider the mapping of hexagonal partitions onto either a
square or hexagonal mesh. On a square mesh, data from the six sides of the hexagons must exit
24
via only four connecting links. Even with all links simultaneously active, some data will be
delayed.
With hexagonal partitions on a hexagonal mesh, each partition edge i:_directly connected to
its six neighboring partitions. However, each pair of neighbors must exchange data. Thus, two
transmission delays are needed on each of the six links before exchanges are complete. If all links
can simultaneously be active, two transmission delays will suffice to exchange boundary values.
Conversely, twelve delays will be needed if only one link per processor can be active at any time.
There are two general approaches to managing the interpartitlon communication problem.
The first relegates management of message passing and the associated queueing of messages for
available links to system software residing in each processor. With this approach, each partition
simply passes the data to be delivered to other partitions to the system software. No
consideration is given to the pattern of communication in time. As an example, each partition
might successively send boundary values on each of its links, then await receipt of boundary
values from neighboring partitions. Although this approach is attractive from a programming
standpoint, it hides the performance issues and may lead Lo increased contention for
communication links.
The second approach requires programming the exchange of partition boundaries in a series
of phases, each phase corresponding to a particular pattern of communication. In the example of
hexagonal partitions on a hexagonal mesh, discussed above, the communication pattern of
neighboring partitions would be alternating sends and receives. Sender and receiver would
cooperate, each expecting the action of the other. This pseudo-SIMD mode of communication
leads to regular communication patterns with minimal delays. Application of this approach is
the subject of the next section.
25
4.2. Message Passing Analysis
Because the range of partition and network possibilities is so large, we have opted to
present only the analysis of the 5-point stencil with square and hexagonal partitions on square
and hexagonal interconnection networks. The triangular partitions were omitted because, as a
cursory examination of Figure 2.3 shows, they require data transmission to four adjacent
partitions. Because square partitions also transmit to only four adjacent partitions and have a
higher ratio of computation to communication, they are always preferable to triangles. The
analysis for 9-point stencils is similar to that presented below; only the case analysis is more
complex.
When partitions are mapped onto an interconnection network, the processors may permit
communication on only one link or on all. In the following we consider only the serial case;
similar analysis applies to simultaneous communication on all links.
We begin with the simplest case: square partitions on a square interconnection network.
n
Each partition must exchange _ values with each of its four neighbors. Because only one link
per processor can be active, we expect the data exchange to require, four phases (i.e., time
4n ) However, this would require all processors to simultaneously send and
proportional to _/p-.
receive. At any given instant, only half the processors can send; the other half must receive.
Hence, eight phases are needed, and the total time for data exchange is
fserial _]_ t_erial q- 8 tcomm
-a = 4tstart_p •
Four startup costs are needed to initiate message transmissions to neighboring processors.
Because square partitions map directly onto the square mesh, no intermediate node forwarding
costs arise. Because the square mesh can be directly embedded in the hexagonal mesh, the data
2B
exchange delay for square partitions on a hexagonal mesh is identical to that for the square
mesh. 4
Like square partitions on a square mesh, hexagonal partitions map directly onto a
n
hexagonal mesh. Recalling that the north and south sides of a hexagon contain _ + 1 points,
n
and the other four sides contain _ points each, the data exchange delay is
tserial .q_ fserial _ 6tstartup + 4 _ q- 1 " n
-- Ps tcomm "b 8 _ teomm.- t_ -11) T
The first ceiling term corresponds to the north/south exchange and requires four phases.
Similarly, the second term represents the exchange of data along the four diagonal connections
and requires eight phases.
Finally, hexagonal partitions can also be mapped onto a square mesh. Unlike the other
mappings, this one requires data exchange between non-adjacent processors. In this case, we
assume that rows of hexagons are mapped onto corresponding rows of the square mesh. With
this mapping, nort;h/,_o.t,h connections and half the diagonal connections are realized directly.
The remaining diagonal connections require traversal of two links to "turn the corner" in the
square mesh. Hence, the total communication delay due to data exchange is
tseriat + t seriat --=a
.1] o o ]
6t, t=,t_p + 4 Ps to°ram + 4 _ tcomm + 8 [t_omm + 4tior_,=,a
T T "
4Thisis only true for the 5-point stencil. With the 9-point and other stencils, the distinction between square
and hexagonal meshesis important.
27
The first ceiling term corresponds to the north/south connections and the second to the directly
connected diagonals, each with four phases. The third term represents indirectly connected
diagonals, requiring eight phases. Half these phases require forwarding through intermediate
nodes, hence the four forwarding costs.
As noted earlier, similar analysis can be applied to other meshes and stencils. Table 6
shows the number of other partitions with which each partition must communicate (i.e., the
number of transmission startups). In addition, transmission delays are shown as a sum of terms.
Each term is a product of the amount of data exchanged between logically adjacent partitions
and the number of phases necessary to accomplish the exchange. In the table, the potential
effects of packet size on transmission delay are ignored, as are the times for startup and
forwarding. Table 6 suggests that hexagonal partitions are preferable for 5-point stencils, and
square partitions are more appropriate for 9-point stencils, confirming our earlier, mesh
independent analysis. As we shall see, however, both the number of message startup.s and
amount of data must be considered when estimating the performance of a stencil/partition/mesh
trio.
4.3. An Evaluation of Stencils, Partitions and Meshes
Equation (4.2), the delay to send data, includes parameters for startup, forwarding cost,
packet size, and packet transmission time. Because our primary interest is the effect of
transmission time, we have ignored the effects of startup and forwarding (i.e., we have assumed
those parameters are zero). When evaluating the relative performance of stencil/partition/mesh
trios, we have attempted to use values for packet size and packet transmission time based on
those for commercial message passing machines. For example, the Intel iPSC [Ratt85] sends 1K
byte packets with a measured transmission time of between 6 and 7 milliseconds.
28
Table fl Message passing data exchange
Number of Phase-Data
Communicating Transmission
Partition Mesh Stencil Partitions Products O(EzpectedDelay)
8n 8n
Square Square 5-poinf; 4 -_p :_p
8n 8n
Square Square 9-point 8 _ + 16
-Wp
8n 8n
Square Hexagon 5-point 4 _
8n 8n
Square Hexagon 9-point 8 _ + 12
n 4n 8n 8n
Hexagon Square 5-point 6 4 _ + I + _ + _
Hexagon Square 9-point 6 4 + 1 "{-_ . _
_vp
-p
n 8n 6n
Hexagon Hexagon 5-point 6 4 2-Vp + 1 + -.r_ --_---]p
n 8n 10n
Hexagon Hexagon 9-point 6 4 + 1 q- "-'v'--77p Vp
Figure 4.2 shows the speedup, obtained using (4.1), of square and hexagonal partitions on
both square and hexagonal meshes, using a 5-point stencil. In the figure, 1K byte packets are
used. We see that square partitions yield significantly larger speedup than hexagonal partitions,
regardless of the underlying mesh. This is counter-intuitive and would seem to contradict Table
6. Careful inspection of (4.2), however, shows that packet size is crucial. The term
dq(S, P) ]Ps lij to°ram
in (4.2) accounts for the discretization overhead caused by packets. If Ps, the packet size, is
large, the number of partitions that must receive data from each partition is much more
important than the total amount of data to be sent. For example, sending 4 bytes to 6 partitions
29
Figure 4.2
Speedup for 5-point stencil
(1024 X 1024 grid with 1024 byte packets)
Speedup
100 , , ,
_7 Square partitions
Hexagonal partitions/square mesh
80- [] Hexagonal partitions/hexagonal mesh
60-
40-
20- !
0 { { {
0 256 512 768 1024
Number of Processors
Parameter Value
Packet size 1024
Startup 0.0
Forwarding 0.0
Packet transmission 6X10 -3 sec
Floating point operation 1X10 -8 sec
30
is much more expensive than sending 6 bytes to 4 partitions if the packet size is 1024 bytes. The
former requires 6 packet transmissions, the latter only 4 transmissions. Square partitions,
because they require communication with only four neighboring partitions, are preferable to
hexagonal partitions with six neighboring partitions, even though more data must be transmitted
with square partitions.
With a 4 byte floating point representation, a 1024X1024 grid, 1024 byte packets, and
square partitions (the assumptions of Figure 4.2), using more than 16 partitions will not decrease
the communication delay because, beyond this point, the total number of packet transmissions
does not change. Instead, the ratio of useful computation to communication begins to degrade.
As the packet size decreases, we would expect the differential in amount of transmitted
data to become more important. For small enough packets, the total amount of data accurately
reflects the delay. Figure 4.3 shows just this result. For smaller 16 byte packets used in the
figure, hexagonal partitions are preferred over square partitions.
Comparing Figures 4.2 and 4.3, we also see the effects of v_rying the number of processors.
For a small number of processors, the iteration is compute bound. As the processors (and
partitions) increase, the distinction between differing partition shapes becomes apparent. With
1024 processors, only one grid point resides in each partition, and the effects of packet size on
performance are striking.
Figures 4.4 and 4.5 illustrate phenomena similar to those in Figures 4.2 and 4.3. For large
packet sizes, Figure 4.4, hexagonal partitions are preferable to square partitions because the
hexagons communicate with only six other hexagons, rather than eight other squares. However,
the square partitions require less interpartition data transfer. Only when the packet size
becomes small, Figure 4.5, does the potential advantage of square partitions become apparent.
31
Figure 4.3
Speedup for 5-point stencil
(1024 X 1024 grid with 16 byte packets)
Speedup
500 , , i
Square partitions
• Hexagonal partitions/square mesh
o Hexagonal partitions/hexagonal400-
300-
200-
100-
I I I
0 256 512 768 1024
Number of Processors
Parameter Value
Packet size 16
Startup 0.0
Forwarding 0.0
Packet transmission 9.375X10 -_ sec
Floating point operation 1X10 -6 sec
32
Stencil, partition, mesh, and hardware parameters interact in non-intuitive ways. For 5-
point stencils, square partitions, with their smaller number of communicating neighbors, are
appropriate for large packet sizes. Likewise, hexagonal partitions, with smaller interpartition
data transfer, are appropriate for small packet sizes. The reverse is true for 9-point stencils:
hexagonal partitions are appropriate for large packet sizes (even though they are 14 partitions6
for the 9-point stencil), and square partitions are best for small packet sizes. The interaction of
parameters cannot be ignored when considering the performance of an algorithm on a particular
architecture.
Recognizing the interdependence of parameters, Saltz et. al. [Salt86] recently evaluated the
Intel iPSC for solution of the heat equation using Successive Over Relaxation (SOR). They
observed that performance on the iPSC, with its high transmission startup cost and large
packets, varied greatly with the size of the grid and the shape of the grid partitions. For small
grids, horizontal strips, although requiring more interpartition data transfer, were preferable to
square partitions. Only when the grid became large did the advantage of square partitions
become apparent. The reasons are precisely those observed in Figures 4.2 and 4.3: amount of
data versus number of communicating partitions. This validation of our analytic techniques
suggests that they can effectively be used to determine the appropriate combination of partition
shape and size given tile architectural parameters of the underlying parallel machine.
4.4. Shared Memory Architectures
Unlike a message passing architecture where partitions exchange values via explici_
messages, a shared memory implementation stores all partition values that must be exchanged in
global, shared memory. The values associated with all other grid points are kept in memories
local to each processor.
33
Figure 4.4
Speedup for 0-point stencil
(1024 X 1024 grid with 1024 byte packets)
Speedup
125 j , ;
100-
75-
50-
25- #/ _ons/hexagonal mesh
# [] I-I_titlons/square mesh
+ Hexagonal partitions/hexagonal mesh
1
I I t
0 256 512 768 1024
Number ofProcessors
Parameter Value
Packet size 1024
Startup 0.0
Forwarding 0.0
Packet transmission 6X10 -3 sec
Floating point operation 1X10 -8 sec
34
Figure 4.5
Speedup for 9-point stencil
(1024 X 1024 grid with 16 byte packets)
Speedup
625 i i ,
Square partitions/square meshSquare partitions/hexagonal mesh
500- [] Hexagonal partitlons/square mesh
+ Hexagonal partitions/hexagonal mc
375-
250-
I I I
0 256 512 768 1024
Number of Processors
Parameter Value
Packet size 16
Startup 0.0
Forwarding 0.0
Packet transmission 9.375X10 -5 sec
Floating point operation 1×10 -6 sec
35
Just as for message passing, the iteration time for evaluating one partition of grid points is
'°°"'°r=toomp+t, + tw, (4.3)
only the interpretation of the access (4) and waiting (tw) times differ. With a message passing
architecture, these times depend on the contention for communication links. Analogously, shared
memory delays arise from memory contention. Vrsalovic, et al. [Vrsa85] observed that the
expected waiting time for memory access takes the form
--_ -- 1]ta synchronous
(4.4)
asynchronous
maz lO, --_--1 ta - tcomp}
where 6' is the number of processors that can access shared memory simultaneously, and ta is the
memory access time. The synchronous case, where all processors simultaneously attempt to
access global memory, forces one processor to wait until all others have accessed memory. The
length of this delay depends on the number of simultaneous memory accesses supported. If the
processors operate asynchronously, allowing overlap of computation and 'memory access, the level
of memory contention is reduced. In the simplest ease, a set of global memory modules
connected to a shared bus, the number of concurrent memory accesses 6' is just 1. If a multistage
switching network connects processors and memories, (4.4) can be replaced with a waiting time
function [Krus83]. Whatever the interconneetion network, tw reflects the effects of memory
contention. We will return to this later, but first we consider the expected amount of data
transferred to/from shared memory.
When considering a shared memory implementation of an iteration technique, two choices
arise: local copies of partition boundaries or only global storage. In the first case, each partition
3@
not only retains a copy of its boundary values after writing them to global memory but also
copies into local memory all boundaries needed from other partitions. With only global storage,
the boundary values are accessed in global memory each time they are needed. The performance
of these two implementations differs considerably based on the "stencil, partition, and memory
access technique. Hence, we consider local copies and global access for both 5-point and 9-point
stencils with rectangular strips, square, and hexagonal partitions.
For notational convenience, we let gta°calcopiesbe the number of global memory accesses for
copying boundary values to and from a partition (assuming local copies), g_"°topic, be the number
of global memory accesses (assuming no local copies), tg be the time to access one value from
global memory, and tl be the processor overhead associated with copying one boundary value.
With this notation, the cycle time for one iteration is
t,oco, copie,= .E(S)n 2 _,oco,cop,, (tg + t,) + t= (4.5)
cycle p Tip "3t-ga "
or
_2
trio cop_ce
E(S)2_ - Tip + g_° c°Pi_' "tg + t=, (4.6)cycle
where the three terms correspond to those in (4.3). As with message passing implementations,
E(S) is the number of floating point operations required to update each grid point given stencil
S, p is the number of partitions, and Tfp is the time needed for one floating point operation.
When local copies are used, some processor overhead may be required to maintain the copies
(e.g., copying from system buffers to user memory); this overhead is reflected by tt. Finally, tg
and tt are hardware parameters; only ga depends on the choice of local copies or global access.
Thus, we concentrate on derivations of _localcopies and no copies9a ga for selected combinations of
stencils and partitions. The results, derived below, are summarized in Table 7.
37
Table 7 shared memory data transfers
Rectangular
Stencil Strip Square Hexagon
5-point
gtOeat 4(n -- 2) 8n 6n
a %/---_-- 4 %/p
n ] 18n
ggtobat 12(n -- 2) 24 -_p- 1 -_p 12
0-point
g':"' 4(n--2) VySn  lOn--4
ggat°bat 20(n -- 2) -_p+840n -_--p --4448n
The values of _tocatcoHe_for both the 5-point and !l-point stencils can be easily determinedga
from Figures 2.2, 2.4, and 2.7. For example, Figure 2.2(b) (with r = X/p-) shows that a square
n
partition reads -_p data global memory values from each of its four neighbors and writes its
own four boundaries back to global memory, s The total number of global memory accesses is
then
i
4n n I 8n
_X/p+4 _p 1] = _ - 4.
The 9-point stencil is similar, requiring four extra boundary values, one from each of the
diagonally adjacent partitions.
5Thefourcornerpointsurewrittenonlyonce.
38
The situation changes dramatically if no local copies of boundary data are maintained.
Boundary values are often used multiple times during an iteration. With a 5-point stencil and
square partitions, updating a single element on the boundary generally requires access to three
values on the partition's boundary and one access to another partition's boundary. The updated
boundary element must then be rewritten. Hence, five memory accesses are required if no copies
are maintained.
The penalty for not maintaining local copies is even more striking for 9-point stencils.
Figure 4.6 shows the number of global memory accesses for each point in an interior square
partition when the 9-point stencil is used and no local copies are maintained. The numbers are
Figure 4.6
Global memory accesses for 9-point stencil
(square partitions with no local copies)
9 8 7 7 7 7 8 9
8 5 3 3 3 3 5 8
7 3 0 0 0 0 3 7
7 3 0 0 0 0 3 7
7 3 0 0 0 0 3 7
7 3 0 0 0 0 3 7
8 5 3 3 3 3 5 8
9 8 7 7 7 7 8 9
3@
determined by counting boundary grid points needed to update the value at each grid point. If
the grid point lies on the boundary, tile count is increased by two: the old value must be be read
from global memory and the new value written. Obtaining a general formula for the number of
memory accesses is straightforward, given diagrams such at Figure 4.6. For a 9-point stencil
40n
with square partitions, _ q- 8 memory accesses are needed, a five-fold increase over that when
local copies are maintained.
4.5. Shared Memory Analysis
Given an analysis of the memory traffic required for maintaining local copies or always
accessing shared memory, only hardware parameters and some assumption about the underlying
interconnection network are needed to predict performance. As noted earlier, the memory
contention function tw can reflect ._ variety of interconnection strategies ranging from a single
global memory bus to a multistage interconnection network. Because the importance of local
copies is most striking when memory contention is severe, we have concentrated on the worst
case: a single bus connecting all global memories and processors.
Figure 4.7 illustrates the speedup obtained for square and hexagonal partitions on a 5-point
stencil with varying numbers of processors and local copies of partition boundaries. In the figure,
access to global memory is assumed to require five times that for a single floating point
operation. The figure confirms the analysis of Table 6 and 7: hexagons are the preferred
partition type. For small numbers of processors, computation time dominates, and there is little
distinction between the partition types. However, as the number of processors increases, the
smaller interpartition data transfer required by hexagons makes their use attractive. Speedup
increases with the number of processors until the global memory bus becomes a bottleneck; at
that point speedup begins to decline.
4O
Figure 4.7
Shared memory speedup for 5-point stencil
(local copies with grid size 1024 X 1024)
Speedup
15 I I I
12- _7 Squares
• Hexagons
0 I I I
0 256 512 768 1024
Number of Processors
Parameter "Value
Global memory access time 5×10 -6 sec
Local processing overhead 0.25X10 -6 sec
Floating point operation 1X10 -_ sec
41
Figure 4.8
Shared memory speedup for 9-point stencil
(local copies with grid size 1024 X 1024)
Speedup
16- _7 Squares
• Hexagons
0 I I I
0 256 512 768 L024
Number of Processors
Parameter Value
Global memory access time 5X10 -8 sec
Local processing overhead 0.25X10 -6 sec
Floating point operation lXl0 -6 sec
42
Figure 4.8 shows a result similar to Figure 4.7, except for 9-point stencils. As Table 7
suggests, square partitions are preferred. A comparison of Figures 4.7 and 4.8 shows that the 9-
point stencil gives larger absolute speedup. The reason is intuitive: the greater computation cost
at each grid point more than offsets the increased communication cost for the 9-point stencil.
Hence, an equal number of processors translates into a greater speedup.
Finally, Figure 4.9 compares maintaining local copies of boundaries to continued access to
global memory. This figure also confirms what Table 7 suggests; local copies are clearly
advantageous. The argument for storing boundaries in local memories is compelling. Without
such copies, the bandwidth of the global bus quickly saturates.
5. Conclusions
The trio of iteration stencil, grid partition shape, and underlying parallel architecture must
be considered together when designing parallel algorithms for solution of elliptic partial
differential equations. Isolated evaluation of one or even two components of the trio is likely to
yield non-optimal algorithms.
We have seen, for example, that an abstract analysis of iteration stencil and partition shape
suggests that hexagonal partitions are best for 5-point stencils, whereas square partitions are
best for 9-point stencils. Further analysis shows that this is only true in a message passing
implementation if small packets are supported. For large packets, the reverse is true (i.e., square
partitions for 5-point stencils and hexagonal partitions for 9-point stencils). Likewise, the type
of interconnection network is crucial. Mapping grid partitions onto a network that does not
directly support the interpartition communication pattern markedly degrades performance.
Finally, when considering shared memory implementation of the iterations, maintaining local
copies of the partition boundaries is imperative. Without local copies, or an extremely fast
43
Figure 4.9
Speedup for 9-point stencil with grid size 1024 X 1024
Speedup
_7 Squares with local copies
° Hexagons with local copies
D Squares without local copies16-
+ Hexagons without local copies
I I I
0 256 512 768 1024
Number of Processors
Parameter Value
Global memory access time 5X10 -6 sec
Local processing overhead 0.25X10 -8 sec
Floating point operation 1X10 -8 sec
44
interconnection network, the observed speedups are extremely small. Consequently, only a small
number of processors can be used effectively.
In summary, stencil, partition shape, and architecture must be considered in concert when
designing an iterative solution algorithm. They interact in non-intuitive ways and ignoring one
or more of the three almost certainly leads to sub-optimal performance.
45
R(_erences
[Fox84] G.C. Fox and S. W. Otto, "Algorithms for Concurrent Processors," Physics Today,
Vol. 37, pp. 50-59, May 1984.
[Krus83] C. P. Kruskal, "The Performance of Multistage Interconnection Networks for
Multiprocessors," IEEE Transactions on Computers, Vol. C-32, No. 12, pp. 1091-
1098, December 1983.
[Orte85] J. Ortega and R. Voigt, "Solution of Partial Differential Equations on Vector and
Parallel Computers," SIAM Review, Vol. 27, No. 2, pp. 149-240, June 1985.
[Ratt85] J. Rattner, "Concurrent Processing: A New Direction in Scientific Computing,"
Conference Proceedings of the 1985 National Computer Conference, AFIPS Press,
Vol. 54, pp. 157-166, 1985.
[Reed83] D. A. Reed and H. D. Schwetman, "Cost-Performance Bounds on
Multimicrocomputer Networks," IEEE Transactions on Computers, Vol. C-32, No. 1,
pp. 85-93, January 1983.
[Reed85] D.A. Reed and M. L. Patrick, "Parallel, Iterative Solution of Sparse Linear Systems:
Models and Architectures," Parallel Computing, Vol. 2, pp. 45-67, 1985.
[Salt86] J.H. Saltz, V. K. Naik, and D. M. Nieol, "Reduction of the Effects of the
Communication Delays in Scientific Algorithms on Message Passing MIMD
Architectures," ICASE Report 86-4, NASA Langley Research Center, to appear in the
SIAM Journal of Scientific and Statistical Computing.
[Voig85] R.G. Voigt, "Where Are the Parallel Algorithms?" Conference Proceedings of the
1985 National Computer Conference, AFIPS Press, Vol. 54, pp. 329-334, 1985.
[Vrsa85] D. Vrsalovic, E. F. Gehringer, Z. Z. Segall, and D. P. Siewiorek, "The Influence of
Parallel Decomposition Strategies on the Performance of Multiprocessor Systems,"
Proceedings of the lZth Annual International Symposium on Computer Architecture,
ACM Sigarch Newsletter, Vol. 13, No. 3, pp. 396-405, June 1985.
[WittS1] L.D. Wittie, "Communication Structures for Large Multimicroeomputer Systems,"
IEEE Transactions on Computers, Vol. C-30, No. 4, pp. 264-273, April 1981.
1. Report No. NASA CR-178102 12. Government Accession No. 3. Recipient's Catalog No.
ICASE Report No. 86-24
4. Title and Subtitle 5. Report Date
STENCILS AND PROBLEM PARTITIONINGS: THEIR May 1986
INFLUENCE ON THE PERFORMANCE OF MULTIPLE PROCESSOR 6. Performing Organization Code
SYSTEMS
7. Author(s) 8. Performing Organization Report No.
Daniel A. Reed, Loyce M. Adams, and Merrell t. 86-24Patrick
9. Performing Organization Name and Address 10. Work Unit No.
Institute for Computer Applications in Science
and Engineering 11. Contract or Grant No.
Mail Stop 132C, NASA Langley Research Center NASl-17070, NASl-18l07
Hamoton. VA 23665-5225
12. Sponsoring Agency Name and Address 13. Type of Report and Period Covered
('nT1t--~~..~- ~
National Aeronautics and Space Administration 14. Sponsoring Agency Code
Washington. D.C. 20546 e:ne: .. ')1 n ... 1"\1
15. Supplementary Notes -~- -~ ~- ~~
Langley Technical Monitor: Submitted to IEEE Trans. Comput.
J. C. South
Final Report
16. Abstract
Given a discretization stencil, partitioning the problem domain is an
important first step for the efficient solution of partial differential
equations on multiple processor systems. We derive partitions that minimize
interprocessor communication when the number of processors is known a priori
and each domain partition is assigned to a different processor. Our
partitioning technique uses the stencil structure to select appropriate
partition shapes. For square problem domains, we show that non-standard
partitions (e.g., hexagons) are frequently preferable to the standard square
partitions for a variety of commonly used stencils. We conclude with a
formalization of the relationship between partition shape, stencil structure,
and architecture, allowing selection of optimal partitions for a variety of
parallel systems.
17. Key Words (Suggested by Authors(s)) 18. Distribution Statement
discretization stencils, domain 59 - Mathematical & Computer
partitioning, parallel computing, Sciences
elliptic equations 62 - Computer Systems
Unclassified - unlimited
19. Security Classif.(of this report) 120. Security Classif.(of this page) 21. No. of Pages 122. Price
Unclassified Unclassified 47 A03
For sale by the National Technical Information Service, Springfield, Virginia 22161
NASA-Langley, 1986


