Introduction
A numerical solution to an elliptic partial differential equation (PDE) is usually constructed by modeling the continuous domain of the equation's variables with a grid of discrete points. The partial derivatives are approximated using some differencing scheme, and a linear set of equations is constructed whose unknowns are the values of the solution function at each of the grid points. During an iterative solution of these equations (e.g. point Jacobi) the value at a point is approximated by a function of values at nearby points. The amount of computational work associated with updating an interior grid point is the same throughout the grid. Furthermore, during a single iteration grid points can be updated in parallel. This high degree of regularity and potential parallelism has made the solution of PDEs a very attractive problem area for the application of parallel processing.
An elliptic PDE problem may be solved in parallel by decomposing the grid into partitions, and mapping partitions to processors. During an iteration a processor updates its grid points, and then exchanges with other processors information necessary to compute the next iteration. As pointed out in [12] , a large number of factors affect the performance of the resulting parallel computation: discretization stencil, partition shape, and parallel architccture. The analysis in [12] quantifies these relationships for a wide variety of stencils, shapes, and architectures. Their work throughout assumes that all processors in a parallel system are employed. This papcr uses their framework to determine the largest possible speedup for a given problem, and to consider the behavior of that optimal speedup as a function of problem size when the number of available processors is not limited. These issues are important when we consider that users of large scicntific codes will always want to solve a larger problem than the current technology suppons. By focusing on the best possible speedup we are better able to access the suitability of various architectures for scaling up to larger problems, and the effects that various problem parameters and architecture parameters have on that suitability.
We will consider both strip and square partitions; although it is well known that squares have a higher computation to communication ratio, situations exist where the use of strips yields better performance than squares [ 131. Other authors have employed strips [7] when the number of available pmesson is not a power of 4 (to avoid this last problem, we show that "nearly square" partitions perform within a few percentage points of true squares).
It is a folk theorem among the parallel scientific processing community that good speedup can be achieved simply by increasing the size of the problem. In fact, our analysis shows that this is indeed true for several different types of architectures. provided that the maximal number of processors i s fixed. However, by allowing the number of processors (and supporting communication network) to grow along with the problem size, it becomes clear that some architectures are better suited for large problems than others. Architectures with hypercube or grid communication networks are shown to give linear optimal speedup in the grid size n2. while bus-oriented networks are shown to give optimal speedup which increases at best in the cube root of n2. The effect of the relationship between fixed communication overhead costs and bus bandwidth is shown to be important. We show that banyan type switching networks give optimal speedup which is O(n*/log(n)). From these results it is clear that bus networks are unsuited for large numerical problems of the type we consider. While hypercubes give better asymptotic optimal speedup than banyan networks, the true difference for grid sizes used in practice will not depend on the banyan network's log factor, but on the relative speeds of the communication networks.
Previous Work
Partition geometry plays a key role in determining communication costs, consequently much of the literature related to domain decomposition concerns the partition's geometric shape. Strips, squares, triangles, and hexagons have been considered in [4, 12, 16] on both message-passing and shared memory architectures. Reed. Adams and Patrick [12] have done a careful analysis of the relationships between discretization stencils, partition shape, parallel architecture, and data structure management.
Their model determines which stencil/partition/architectures trios are best suited for each other. We will introduce their model in the main partion of the paper. Neither the analysis in [12] nor other work conceming partition shapes has explicitly focused on optimizing the number of processors used, or on the behavior of optimal speedup as the problem size increases.
An analytic study of a conjugate gradient algorithm on the Finite Element Machine (FEM) is found in [ 11. Their approach to modeling the computation is similar to ours, but is focused entirely on the FEM. The difference between the algorithm they study and the class of algorithms we study led to different conclusions concerning asymptotic performancc.
Other related work uses a more abstract model of a parallel computation. In [6], Indurkya, Stone, and Cheng consider the module assignment problem under the assumptions of random module execution times and random communication patterns. They explicitly set out to determine the optimal number of processors to use. Convenient approximations were made to make the overall execution time more tractable; some of these approximations werc removed by Nicol in [9], where it is shown that Indurkya's conclusions are basically sound despite the approximations (all of Indurkya's conclusions hold rigorously if module execution times are constant). The cost function studied in that work was the sum of execution time with the expected communication ovehead. Their somewhat surprising conclusion is that the optimal assignment of modules to processor is extremal: either all modules are assigned to one processor, or the modules are distributed as evenly as possible across all available processors.
The cost model studied by Indurkya et al. and Nicol fails to capture the potential overlap of communication and computation in some architectures. Stone [15] also realized this, and gives a thorough analysis of a number of simple cost and communication models for the module assignment problem.
Several of these models allow situations where adding processors increases execution time, so that the optimal assignment need not be extremal. For computations captured by these models, finding the optimal number of processors becomes an important issue. Stone uses a parallel solution of the Poisson equation to illustrate the relationship of these models to a real problem. His discussion does not treat the relative merits of partition geometries and stencils, although he does consider partitioning domain rows into pieces. A similar abstract view of this problem is given by Cvetanovic [3] . In contrast, our goal in this paper is to show how to optimize the size of a given partition shape for a given PDE on a given architecture. We then use the optimal size to characterize the suitability of the architecture for large numerical problems.
Model Description
A square physical domain is discretized into an nxn grid of points. and constant boundary values are assumed. Depending on the algorithm used, the value at a grid point uii is updated according to a discretization stencil. For example, figure 1 shows a 5-point stencil and a higher order 9-point stencil idea, we define k(P, S) to be the number of perimeters communicated by partition P using stencil S.
Some values of k(P, S) are given below. memory constraints and processor availability constraints. Other constraints concern the partition's shape. Square partitions only admit values of A which are perfect squares, thereby reducing substantially the number of feasible domain decompositions (and hence freedom in choosing the number of processors). Furthermore, it is possible to assign exactly equal work to each processor only if the number of processors divides the number of grid points evenly. We will therefore relax the requirements that each partition have exactly the same number of points. and when using square partitions relax the requirement that partitions be exactly square.
It is easy to decompose the domain into strips for P processors: if n = k.P + r with 0 5 r <p then r processors receive p.1 + 1 contiguous rows, and the remaining processors each receive con-P P tiguous rows. As illustrated by figure 4, the number of communicating boundaries is the same as if all the partitions have equal work. Square partitions raise harder problems. We will approximate square partitions with nearly square rectangles which cover the domain in a nice way. The rectangles are arranged in a grid fashion as illustrated in figure 5 . The domain is first divided into strips as before; then into rectangles by defining a border every mrh column. We require that rn divide n evenly, and call these legal rectangles. For tractability our analysis treats partition execution and communication costs as though the partitions are squares. However, empirical studies described below show that the error introduced by this assumption is small.
For a given n it is easy to calculate the area of each legal rectangular partition. Figure 6a shows the relative approximation e m r in area for a 256 x256 grid when we choose the working rectangle with area closest to A; Figure 6b shows the rclative approximation e m r in perimeter. A ranges from 1024 to 16384 (every even value of A is plotted), reflecting decompositions using 4 tb 64 processors. We see that the e m r introduced by this approximation is quite small, usually less than 3% for area and less than 6% for perimeter. Similar results were obtained for 128x128, 512x512. and 1024x1024 size grids.
We can consequently optimize partition area as though partitions are exactly square with the assurance that the costs obtained are not far different from costs that are truly achievable. We next consider this optimization for various architecture types. where a is the per packet transmission cost, and p is a startup cost. We assume that the problem size is fixed at n2, and that if N processors are used, each partition gets n2/N points. Thus, as we allow N to increase for a fixed value of n2, the number of points in a partition decreases, so that 110th the execution cost (re& and the communication/synchronization cost (tJ for the partition decreases. This implies that twb as defined in equation (1) is a decreasing function of N over the interval [2, n2] . If only one processor is used then no communication costs are suffered; if the one processor execution cost is still gmter than the two processor cost, then using a l l processors is optimal. If the one processor cost is less than the two processor cost, but greater than the cost of using all processors, then using all processors is again optimal. The last possibility is that the communication costs so high that the one processor cost is less than the cost of using all processors. In this case, using only one processor is optimal. Thus we see that is minimized by either spreading the computation out over as many processors as possible, or by placing the whole domain into one processor. If memory limitations prohibit the latter option, then the computation should be spread maximally.
Assume that the grid is spread across all available processors as squares, and consider the effect of increasing both n2 and N in such a way that the number of grid points per processor remains constant (say F points per processor) as n2 increases. This implies that the optimal cycle time is the conc stant C = E(S).F.Tfi + 8( r 4fi la + p>. The optimal speedup is then packetsize linear in 2.
If the number of processors is fixed at N, the cycle time of a processor is then E(S)-n2*Tfi + 8(r*ia + p).
where V(n2) denotes the volume of a partition's communication. V ( 2 ) = 2n for strips and *his expression assumes that only one communication port can be active at a time in a processor, and that the communication link is half duplex.
V(n2) = a for squares; it is easily checked that speedup for both squares and strips approaches N as n2+.
The quick analysis above fails to consider the very important activity of convergence checking.
A convergence check requires that every updated grid point value be compared with its last value.
Depending on the convergence criterion employed, another iteration is called for if the updated solution is too "different" from the last estimate. Every partition determines whether its subgrid is converged and produces either a convergence flag, or a number (e.g. sum of squared update differences over subgrid) which must be disseminated throughout the entire network. For small stencils like 5-point, the additional computation required to do a convergence check can be 50% of the grid update computation. Furthermore, communication during the dissemination stage is not local, and the delay due to this stage increases in the number of processors used. Saltz, Naik, and Nicol examine this problem in
[13], and note that the communication cost for convergence checking is extremely high due to message packaging and handling costs. They then give algorithms for scheduling convergence checks; measurements taken on an Intel ipSC show that despite the potentially very high cost of convergence checking, these algorithms reduce that cost to an insignificant amount. For the sizes of hypercubes currently available, we may safely ignore convergence checking costs in hypercubes. should be spread as evenly as possible or lumped onto one machine (which makes little sense on the fore-mentioned machines). This type of machine often provides a global bus, and additional hardware for functions such as convergence checking. Provided that such additional hardware exists, the communication overhead of convergence checking does not appear to be as significant a concern as it is -13-with hypercubes (although the additional computational cost may still be significant).
Adams and Crockett [ll analyze a conjugate gradient code on the FEM. Each iteration of this code requires every processor to send every other processor a number, and a processor adds together all such numbers. Eventually adding more processon to a fixed size problem causes this communication and addition to dominate perfomance. The result is that increasing the number of processor past a certain threshold increases the algorithm execution time. This highlights the fact that the monotonicity we claim for hypercube and grid machines depends very much on the exclusively neareSt neighbor communication pattern. In the next section we will see that in bus architectures the communication cost can actually decrease in increasing partition size, making for a more interesting optimization problem.
Bus Architectures
Shared memory bus architectures are another important class of commercially available parallel processors. Currently, several vendors offer a few tens of pmcessors on a common bus; we denote the maximum number of pmcessors available by N. We suppose that the architecture supports local memory and global memory, with global memory access being several times slower than local memory access (several of the commercial machines do not support this model; they do support caches which, if sufficiently large could be viewed as local memory).' We will consider both synchronous and asynchronous busses: a synchronous bus requires a processor requesting service to' wait until that service is completed; an asynchronous bus admits overlapping computation and data writes to the global memory.
We will see that in both cases contention for global memory via the bus can degrade performance to the point where adding processors decreases execution time. Reed et al. [12] also observe that a processor's management of boundary values makes an important difference in performance. Following their advice, we will assume that each processor copies its neighbors' boundary points into local memory at the start of an iteration, and writes its own boundary points out to memory at the iteration's end. In our experience on the FLEW2 [8], the cost of transferring a word to or from common memory is best modeled (ignoring contention) as c + b, where c is a fixed overhead cost due to address calculation and any overhead for accessing the bus, and b is the bus cycle time. Because all communication is serialized by a bus, the relative importance of different types of communication can be compared by their volume. The cost of communicating convergence checking information on bus architectures is insignificant because it involves only one number from each processor, and is hence ignored here.
Synchronous Bus
We model a synchronous bus and the contention it imposes by assuming that if P processors are simultaneously requesting service, the effective delay Seen by each processor is c + bP time per floating point number ' . The transfer time fa depends on the partition and on P. For strips with area A. each n2 partition has 2n boundary points, and -processors simultaneously require bus access. t, for strips is 
are decreasing in (2)
A, making (2) the sum of a convex increasing tern and a convex decreasing term. Equation (2) is consequently a convex function of A, so that the d minimizing (2) is easily found using calculus. If the d so determined falls outside of bounds placed by memory or processor limitations then either the least or the largest admissible value of A optimizes performance. d is given by %or our problem, this assumption yields the same performance as if every processor were able to retain the bus for its entire transmission. This follows since one processor will be last to receive the bus; its effective communication time is c + bP per floating point number. This model also implicitly assumes that available processors which are not participating in the computation do not significantly interfere with bus service.
It is important to note that d depends on most of the problem and architectural parameters assumed by n2 the model (the overhead cost c does not affect A). When A^ > -is not a multiple of n, we calculate N A^ n AI = n L-J, and Ab = A, + n. Between these two we choose the area yielding a smaller cycle time; the convexity of (2) ensures that this time is optimal among strips. Substitution of A into (2) gives the optimized cycle time when a&iuarily many processors are available,
Here we see that for sufficiently large n (or sufficiently small c) the computation time and the communication time are essentially identical. Then this expression shows what leverage we have in improving performance by improving hardware. For example, suppose that we have optimized performance for one set of architectural parameters, and wish to increase processor or bus speed. If we double the speed of the bus, the minimized cycle time decreases by a factor of l/G; the same improvement is achieved by doubling the speed of a floating point operation. Since the original configuration was optimized, these factors bound from above performance gain we can achieve by doubling pmcessor or bus speed on any subsequent partitioning of the domain. On the other hand if c is large relative to expected problem sizes, then the overhead cost 4.nek(strip,S) will dominate the communication cost so that any specd increase in the bus will not significantly improve performance; on the other hand, decreasing c has a linear impact on f&!. Now suppose that c L y is minimized by 8 = -, 2 I P I N. Then P processors are employed, and n2 P
?his expression assumes that the number of boundary points a partition writes to global memory is the Same as the number read in. This is not rigorously m e for any stencil which uses diagonals: our expression does not count diagonal elements required by the 4 comer points. However, when the number of partition points is large relative to the number of processors, this approximation is reasonable.
-17-f is a mot of the equation above. Substituting this s" back into the equation above, we find that a necessary condition on P is that clb 5 P. Recalling that bus architectunx typically have fewer than 30 processors, we see that this inequality tightly constrains values of b and c. Measurements taken on the F'LEm2 suggest that cib = 1o00, implying that numerical problems run on that machine should use all processors. Clue in allocating pmcessors is apparently needed more when c is less than b. Consequently, we now consider the extreme case of c = O , and the optimal speedups that are achievable under that a s s u t n p t i m Note that any speedups so derived Serve as upper bounds on speedups gained when c f 0.
If there is no overhead associated with accessing the bus, the optimal square partition size is easily shown to be
The cycle time using 3 points per partition is t&y = (E(S).Tfi)1'3(4.n2-b.k(square,S'))z3 + 2(E(S).Th)1"(4.n2-b.k(square,S))m, which shows that the communication cost is twice that of the computation cost. This expression also shows that we have more leverage by improving communication speed than we do computation speed: doubling the speed of the bus gives an cycle time which is 63% of the original; doubling the speed of a floating point computation gives an cycle time which is 79% of the original. As with strips, simple algebra shows that fewer than N processors should be used if Inequalities (4) and (6) show that a suip decomposition of a given problem will always call for fewer (or equal) processors than a square decomposition (provided that k(square,S) = k(srrip,S)). The minimal problem size which uses all N processors is found by treating (6) as an equality, and solving Figure! 7 plots the the log (base 2) of the minimal problem size n2 which gainfully uses all N processors, as a function of N. For the parameter values considered we see that a 256x256 grid with square partitions and a 5-point stencil should be solved on 1 to 14 processors; the same grid with a 9-point stencil should use 1 to 22 processors. The higher computation to communication ratio of the 9-point stencil allows more parallelism in computation for the same amount of communication.
For sufficiently large t ? all N processors should be employed. The speedup achieved is which also appmaches N as n2+.
Comparison of this speedup with speedup for strips (equation (5) with c = 0) shows the clear superiority of squares using realistic parameter values and large problems.
Supposing that E(s).Th = b, N = 16. k(srrip,S) = k(squureS) = 1, and n = 256 the speedup for strips is these machines: the speedups we calculated for a 16 processor machine on large grids were acceptable.
However, significantly larger speedups for this same problem are possible using a (larger) hypercube.
If minimizing the computation's execution time remains the prime objective then other architectures should be considered.
Asynchronous Bus
Better performance can be expected if we are able to overlap communication and computation.
We next consider an architecture which allows asynchronous writes to global memory, but requires processors to wait for completion of their read xequests. We then view an iteration as a reading phase, followed by a computation phase. During the computation phase, we assume that a boundary value is written to global memory as soon as it is updated. To maximize performance, we also assume that boundary values are updated before any other points.
The time required to read the boundary points is exactly half of ta derived in the previous section.
During the computation phase, a boundary point is updated every E(S).Tfi units of time until all boundary points have been updated. The time required to update all A points in a partition is E(S)-A.T-If at this time the bus has managed to complete all requested writes, then the iteration is finished. Otherwise, the iteration does not terminate until the bus services its backlog of boundary value writes. If a backlog exists after all points are updated and P processors are in use, then clearly the bus is unable to service P boundary value writes in time E(S)Tfp. Consequently, if a backlog exists, the bus has been fully utilized during the entire computation phase. We may therefore write tcycle = t r e d + maxIE(S).A*T' b.Btota1) (7) where tred = t$2 and Bto,l is the total load (summed over all processors) offered to the bus during the iteration. The corresponding area given by equation (3) for a synchronous bus is exactly a factor of larger.
As befoxe, it is easy to show that fewer than N pmssors should be used if
The optimal speedup is given by
Comparison with the synchronous bus speedup shows that the asynchronous bus speedup is a factor of fi better.
The cycle time for a square partition with 2 points is This is a convex function of s which is minimized when the arguments of the max function are equal:
This area is identical to that calculated for the synchronous bus case. The asynchronous bus optimal speedup is SpeedupS,x, = which is 150% larger than the synchronous bus speedup.
The most interesting thing to note about our asynchronous bus results is their relationship to the synchronous bus results. For both strip and square partitions we observe that optimal asynchronous bus -22-performance is a constant (albeit substantial) factor better than synchronous bus performance. Constant factor improvement remains even if we relax the requirement that global memory reads are synchronous (in this case we assume that half the grid points are updated in parallel with the initial read requests, the other half in parallel with the boundary writes; this gives an additional 126% improvement in speedup). The inevitable contention for communication resources, even when conducted in parallel with computation and even when fixed ovehead is ignored, constrains the optimal speedup to be 0((n2)1'4) for strips and 0((n2)'") for squares. (1) The number of global memory modules is equal to the number of processors;
(2) Each processor has local memory, and only boundary values are stored in global memory;
(3) The network switches are 2 by 2; (4) The network is sufficiently fast so that we can ignore contention while boundary values are asynchronously written to glabal memory.
Item (1) -J optimal speedup.
These switching network speedups differ from the hypercube speedups only by a factor of l/log(n); a factor which arises from the growing number of stages of the switching network as the problem grows. For the size of problems treatable in the near future, this log factor will not be as significant in determining performance as is switching network speed (for banyan networks), and message packaging costs (for hypercubes and grids).
Conclusions
A number of factors influence the performance of an elliptic PDE solution on a parallel architecture. Reed et i d . [12] detail the interactions of stencil, partition, and architecture; we use their framework to look at issues in processor allocation, and maximum possible speedup. For various types of architectures we developed equations describing execution time; invariably these functions turned out to be convex in the number of grid points assigned to a processor. This convexity shows that the best assignment of grid points to processors either (1) uses as few processors as possible, (2) uses as many processors as possible, or (3) there is a unique preferred assignment which does not use all available processors, and is easily determined using calculus. We show that for any collection of model parameter values, optimal performance on hypercubes, grid-like, and switching network types of architectures is achieved either by spreading the problem grid across all processors, or by forcing the grid into as few processors as possible. This result depends heavily on the fact that communication for the algorithm studied is strictly nearest neighbor, existing studies [l] provide counter-examples for other communication patterns. For our problem, both synchronous and asynchronous bus architectures allow for optimal assignments which do not use all processors. However, we showed that in order for this situation to arise, the fixed overhead cost of communicating a word on the bus must be nearly as small as the bus cycle time. Our formulas predict the smallest grid size which needs all available processors to perform optimally; they also give upper bounds on the optimal speedup possible. We noted that bus architectures can achieve acceptable speedup on reasonably sized grids, despite the potential for rela-.
-25-i tively high contention for global memory. Also, by looking at optimized execution times on bus architectures, we identify the leverage on performance given by increasing p m s s o r or network communication speed.
We also examined the suitability of these architectures for solving increasingly large problems. It is seen that for any of the fore-mentioned architectures with N fixed processors, the speedup approaches N as the grid size increases. More interesting is the behavior of optimal speedup when we let the architectu~e grow with the problem size. There we find that square partitions are smngly preferred over strip partitions; that hypercube speedups grow linearly in n2, switching network speedups grow propodonally to n2/log(n), and that bus architecture speedups grow only as (n2)'", even if bus access is completely asynchronous. volume. At best, we can expect speedup to grow in the square mot of the computation volume.
Allow contention proportional to total communication volume (summed over all partitions), and the optimal speedup drops to the fourth root of n2. Even for squares, allowance of such contention restricts speedup to a cube mot of n2. The clear implication is that contention eventually causes serious performance degradation; our analysis shows how bad that degradation can be. It is also interesting to note the rather limited leverage we have on improving bus architecture performance by increasing processor or communication network speed: reducing the floating point time by llk decreases optimal execution time only by -p; a similar reduction in bus time reduces optimal execution time by -A. On the other hand for strip partitions, reducing the fixed overhead cost of communication decreases optimal execution time linearly.
One possible means for reducing contention is to use clever scheduling to access communication resources. We have not yet explored this possibility. but suggest that it is important to do so given the significance of the degradation our analysis predicts. Future effort will be devoted to verifying our analysis empirically, and to investigate the fore-mentioned scheduling issues.
