Abstract -This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
the latency and bandwidth of the processor communication medium.
The first model is that of a "medium-scale" shared memory multiprocessor, having perhaps 2-32 processors, with each processor capable of exploiting substantial local vector parallelism. Section II of this paper gives a formal description of the shared memory model and illustrates a method of analyzing algorithms for machines of this type. Several standard, but important, numerical problems are studied and a number of alternate implementations are analyzed. In particular, it is shown that for machines which have two levels of parallelism, the performance of algorithms strongly depends on the way in which the problem is partitioned to fit on the architecture. The performance of the algorithms is given as a function of global and local memory latencies, the speed of arithmetic operations, the number of processors, and the size of the problem.
The second model is that of a highly parallel MIMD system where processors communicate through a large network and there is no shared memory. We assume here a number of processors ranging from perhaps 32 to a few thousand, but with processors of lesser power than in the shared memory model. Analysis and design of algorithms for such systems turns out to be significantly different from that for the shared memory machines. In Section III, it is shown that the techniques used in VLSI complexity analysis can be used to derive reasonable upper bounds on speed up and efficiency. The appropriate parameters for this analysis turn out to be the ratio of message transmission times to arithmetic speed, and the relation of the problem being solved to the topology of the communication network. By looking at specific algorithms, it is shown that many of the derived upper bounds are exact.
As a variant of this second architecture model, in Section IV we consider machines interconnected by packetswitched communications networks. Analysis of algorithms for such machines is similar to analysis of algorithms for other nonshared memory machines, except that communication delays play a central role. The paper concludes with a discussion of the shortcomings of the approaches described here and suggests several directions where more work needs to be done.
II. SHARED MEMORY MACHINES
One of the clearest trends in commercial systems is the trend toward multiprocessor shared memory architectures (see Fig. 1 ) where each processor has either a pipelined multitasking or vector capability. This family of multiprocessors includes the Cray X-MP, Cray-IL, the HEP-I and 0018-9340/84/1200-1180$0l.00 C 1984 IEEE HEP-IL,9 and the ETA GF10. The proposed Cedar multiprocessor [ 12] may be viewed as a machine in this class where each "processor" is, in fact, a cluster of smaller processors. In this section, we consider the design and analysis of algorithms for such machines. We begin with a list of properties shared by many machines in this class. (Of course, no list will characterize every machine, and the set of specifications below should be considered only as an approximnation to a family of architectures.) 1) There are p processors, with p roughly in the range 2 p ' 32.
2) All processors have equal access to shared memory and vectors may be of arbitrary length and stride. 2 3) Each processor also has a sizable local memory from which it can fetch vectors of arbitrary length and stride.
4) Each processor can perform vector diadic operations (or vector triadic operations where one operand is a scalar) using operands in from either the local memory or global shared memory. The execution time for a vector operation of length n is r2 (n + n42) if either operand is in global memory, while it is rK'(n + n'l2) if both operands are in local memory. Here r2-' is the asymptotic performance rate for one processor, and n1,2 is the vector length required to achieve half the asymptotic performance rate, an idea due to Hockney and Jesshope [18] . Here we assume that local memory accesses have much less latency than global memory accesses, and thus, n'II,2<<n2« The scale of this inequality depends on the machine. For a system that uses a network of log(p) stages to bring data from shared memory into local (cache) memory, one may have G L n,,2 = n112 + C * log(p) for some constant C. In general, the ratio n'l2/nt,2 might vary between 10 and 1000.3 ' The HEP I and II machines belong in this class, although the model of analysis given below does not fully describe their performance.
21n some systems, performance may be severely degraded if two processors access the same vector or an access has nonunit stride. The algorithms that follow avoid multiple accesses, but to simplify the analysis, nonunit stride performance problems have been ignored.
3An alternate assumption would be that the latencies nt,2 and n,,2 are equal, but that the computation rates for operand from global and local memory differ or that all vector operations on global data must be "cached" to local memory before execution. An analytical model can be built from any such set of assumptions.
5) The parallel execution of p tasks on p processors is denoted by pardo(i = 1,p) task(i); endpar;
Here task(i) is a procedure, block, or statement that is executed on the ith processor. No assumptions will be made about the processor synchronization or task-scheduling mechanisms other than that the execution order will be consistent with the serial data dependencies.
Algorithm Design
The most natural way to design algorithms for these systems is to employ what the mechanical and structural engineering community has called problem substructuring. In this approach, the problem is divided into a set of independent tasks, each operating on its own portion of the data structure.
To illustrate this idea we consider an example studied many times in the literature [5] , [18] , [21] , [35] , [17] , that of solving T systems of tridiagonal matrix equations, each of size n. Our notation is as follows. Vectors and arrays will be denoted with capital letters (X, Y) and problem instances will be denoted with a superscript. Scalars (and vector components) will be denoted by lower case letters (with subscripted positions). A One useful method for comparing these two implementations of this algorithm is to compute the "effective efficiency" of each. Observe that the asymptotic speed of our machine is rcOp operations per second. The set of T tridiagonals requires (8n -7)T operations. If the machine could be programmed to operate at 100 percent efficiency, the execution time would be
The effective efficiency of the simple algorithm operating from shared memory is defined by The resulting system is shown in Fig. 2 . The subsystem corresponding to rows (i -1) * n/p and i * n/p -1 for i -1, p can be solved-by the "simple" method described above, and the remaining variables can be solved by the "back solve" process. bk-solve: pardo(i = l,p) for] = (i -1) * p/n + I to i*p/n -1 do begin [ The only use of the shared memory is to solve n reduced systems of size 2p required to complete the substructured elimination. The total time to complete this is as follows.
Step 1): Download data and solve column equations:
Step 2): Do the substructured elimination and upload final results:
The alternative is to use the simple algorithm for both column and row equations. Unfortunately, this requires that the partial solution vector Z be moved back to shared memory and read back to local memory in transposed order. Based on our memory addressing assumptions, this step requires a minimum of r2 1(n/p) (n + n 1/2) seconds.
Step 2'): Transpose Z and use the simple method and upload the final results:
To determine the effective efficiency, observe that to compute B -1A -Iy requires at least n (16n -14) operations. If the machine can be programmed to run at 100 percent efficiency, the execution time would be r2--(16n -14). The asymptotic efficiency for the substructured algorithm is 50 percent (computed in the limit as n goes to infinity) and 62 percent for the transposed simple method. In the case that n is small (near or below n/2), the substructured algorithm is superior. To illustrate this, consider the special case of p = 32 processors, np42 = 1000, and n4/2 = 10. Fig. 4 depicts the efficiency as a function of n. The point at which both methods are equal is when n is approximately 12000. In general, one can show that the substructured algorithm is superior to the simple scheme when [24] , PASM [32] , Cedar [12] , and the ULTRA Computer [15] . These systems are considered in Section IV.
Our principal focus in this section is on nonshared memory machines where data movement is explicitly controlled by the processors and limited by the topology of the network. [4] , [8] , [23] , [31] , [38] ) to study the area-time tradeoff in VLSI design can be used to find lower bounds on T' (O, /3). Define a bisector B of a graph G of n nodes to be a set of arcs of G whose removal separates G into two graphs of equal size (i.e., if n is odd, one subgraph is size (n/2) + (1/2) and the other is of size (n/2) -(1/2)). Let bG be the number of arcs in the minimal bisector. If A has n inputs and n outputs, then im maps n/p inputs and n/p outputs to each of the p processors. In other words, we assume the implementation "uniformly distributes" the problem. This condition implies that the set Im-1(B) is a bisector of the input and output nodes of G (A). Let bpr be the size of the minimal bisector of G (A) for all algorithms that solve the given problem Pr. We then have
This inequality characterizes the communication "bottleneck" imposed by the network-topology-induced bandwidth constraints. Network delay can also play an important role. A function with n inputs and n outputs is said to be transitive if every output is a nontrivial function of every input. For any transitive function that has been uniformly distributed over p processors, the minimal time that information about each input can be propagated to each output is 2,3 log(p). In this case, we have TsEQ(1,O0) 3) Elln: The direct solution of the n linear equations obtained by a simple approximation (five-or nine-point star) of a second-order elliptic boundary value problem on a unit square discretized as an VA x \/n grid.
The first problem here is well known [36] , but the second two have not been studied in this context, and it is interesting to note that the same proof applies to all three problems. Lemma bFFr,= n; bTRi, = 2; bEI,l= 2VA.
Proof: Each of the problems above can be viewed as the problem of solving a matrix equation of the form Ax = y for a given n x n invertible matrix A. The basic idea is as follows: any bisector will divide the flow graph of an algorithm into two "machines" where one machine has one half the input vector, call it yi, and the other machine has the other half Y2. The bisection width of the problem is defined to be the minimal amount of "information" about Yi that machine 1 must send to machine 2 plus the "information" about Y2 that machine 2 must send to machine 1. The theorem states that, for example, no algorithm exists for directly solving elliptic boundary value problems for which the information flow between the two halves of the system falls below 2V-words for all input vectors y. The Fast Fourier Transform The FFT algorithm on a problem of size n can be defined in many ways. Here we follow [2] . The algorithm is expressed' in terms of a sequence of log(n) permutations. It can be shown that the log(p) stage butterfly network is topologically equivalent to the log(p) passes of the Shuf permutation. (The rearrangement is shown in Fig. 6 and a formal proof is given in [26] .) Consequently, on machines with Com and Shuf interconnection topologies, the execution time is approximately To emulate the permutation BflYk on a ring network requires uniform shifts of distance ± 2k-l. With each such permutation requiring 2,32k seconds, the speedup is found to be
On a mesh-connected computer, the uniform shifts of distance 2k-l can be done in time 132k-2log(p) (see [37] ) and the speedup is spFFT -P Mesh p1/2 1+r-log(n) In the case of the Ring and the Mesh, the speedup agrees with the theoretical upper bound and must be optimal'. In the other cases, we feel the algorithm is optimal and the speedup bound is too generous. Fig. 7 depicts the relative efficiencies of the FFT algorithm on the three networks described above in the case that p = 512 and r = 1.0.
Tridiagonal Systems
To To execute this algorithm on the Shuf network, we need only embed the tree structure in the target graph (as shown in Fig. 8) , and the execution time and speedup will be the same as on the tree. The tree has the property that any node can be reached in at most 2 log(p) steps from any other node; consequently, no direct embedding of the tree is possible in the ring or mesh. We do not know of an algorithm for these networks that is within a factor of log(p) of the correct communication complexity.
The tree computation above proceeds as a wave from the leaf nodes to the root. If we consider the problem of solving m tridiagonal systems of size n (m_Tri_n), we can pipeline the method above. The execution time is T`-T -'(a, ) = 17a(-+ 2(n -m -1) + 3 log(p)) + 9a + 30,B(n + log(p)).
This result will be used below.
Fast Poisson Solvers
There is a wide variety of interesting direct solvers for the Poisson equation V2x = y (see [18] , [16] , and [29] ). The method here is the easiest to explain. Consider a grid of n 112 X n112 points. The basic fast Poisson solver operates in three steps on an array of values y. 1) Apply an FFT to each row of the grid of y values. = RowJFFT's(y).) 2) Solve n systems of tridiagonal equations of size n using the columns of the grid of values as the right-hand sides.
3) Apply an inverse FFT to each row of the solution of the previous step.
With Dirichlet boundary conditions, the FFT's used here are actually real sin transforms which can be computed with the complex FFT algorithm (see [18] for details). Assume that we have p processors to execute this algorithm. Let k be a divisor of p and partition the problem so that the columns are divided into k groups and the rows are divided into p/k groups. We can then assign each processor to a block with n 1/2/k columns and n 1/2(k/p) rows. In steps 1) and 3), each row of k processors must compute n 112(klp) FFT' s of size n112.
Using the algorithms above, the minimal execution time for both steps is TFFT'S = 2n n log(n) n k) -303) log(k) (3.2) where R is Independent of k. Minimizing (3. 2) as a function of k is, in general, not easy. There are two interesting cases to consider.
Case 1: (nlp) c (17/E) (a/,3) + (15/2).
In this case, the last term in (3.2) is negative; thus, we choose k as large as possible. If p < 1/2, then we set k = p which implies that it is best to distribute the FFT's across all p processors and to solve the columns of tridiagonal systems without communication (n 12/p columns per processor.) If np12 or p is greater than (17/2)(a/,3) + (15/2), we find n1/2 < p. Setting k = n 1/2, the FFT's are again distributed and the execution time is TF S 2n log(n) The optimal value for k is found to be k = max(1, (15p -n (1 + 30 a)) ) In the case that n"2 » 15p/2, we have k = 1 and the execution time is TFPS = a(.(log(p) + 2 + 34 log(P) + 30p (n 112 + log(p)).
Setting k = 1 implies that each row of the grid is stored completely in one processor and the FFT's involve no communication. Consequently, only the solution of the tridiagonal systems involves communication and the time bound above is valid for a tree network. The resulting speedup is of optimal complexity. On the other hand, the theoretical lower bound for communication cost for an Elliptic problem is f3C (n 1/2/p 112) for a Mesh network and /3C (n lf2/p) for the Shuf network. The time estimate above suggests that this form of the fast Poisson solver is suboptimal for these networks.
Based on the upper bounds for speedup in Table I , Fig. 9 illustrates the relative performance of optimal fast Poisson solvers as a function of problem size (n) when p -512 and r -1 for the three network topologies Ring, Mesh, and Shuf.
Multigrid iterative methods [3] , [11] are structured so that each stage of the iteration reduces an n 1/2 X n 112 problem to a problem of smaller size which is solved by a direct method. Table I Now looking in detail at each of these, for the first approach, the execution time is just m times the execution time of FFT_n. That is, Tr-FFT-n = a(r) log(n) + mX,(n) log(p).
With the second approach, one will have only log(p) communication steps rather than m log(p) as in the first approach. However, each step is now a permutation on mn words of data rather than on n words as in the first approach. The execution time is thus T-FFT-n= a(2) log(n) + Xp(mn) log(p). p At first sight there appears to be little difference between the two approaches. However, the function Xp(n) satisfies the inequality X(r(mn) ' mXp (n) for all m, so the second approach is always at least as good as the first approach.
The third approach here is somewhat more complex. Two operations are involved, permuting the data, so each of the m vectors is distributed over a block of p/m processors, and then performing FFT's on these processor blocks. The FFT algorithm needed here is just the FFT for a single data vector FFT_n, already studied, except that only p/m processors are used now. The execution time to perform these FFT's on processor blocks is aF = C /m)log(n) + XP(nm) log where we have used the identity Xpim (n) = Xp (mn) which is easily derived from (4.1).
The other operation needed is permuting the m data vectors. Each data vector is originally distributed evenly over the p processors, and must be moved so it is distributed over a block of (p/m) processors. Comparing these equations, it is clear that the first algorithm is never better than the second, as already mentioned, since the impact of 'y is smaller in the second. Note that this conclusion applies only for the packet-switched networks under consideration. For networks with fixed or circuitswitched topology, these two algorithms perform identically. Fig. 10 illustrates the efficiency in the case that a = /3 = 1.0 and y = 50.0, p = 512, and n = 1024. In this case, we have plotted the performance as a function of the number of equations m. Observe that the third algorithm becomes better than the second when m ' 4. Had we included the cost of the "bit reversal" permutation in all algorithms, the third algorithm would have become better even earlier. Between the third and fourth algorithms, there is nothing to decide, since the third applies only to the case m ' p and the fourth to the case m ' p. Fig. 10 depicts these two methods as one with a transition at m p.
Although searching for optimal algorithms is interesting, the real issue here is the impact of the delay fy caused by the use of a packet-switching network. The effect of y depends on the ratio of problem size to the number of processors; on problems with a great deal of computation, y is well masked. In fact, in the three FFT algorithms for multiple data vectors found to be best, y always enters the execution time in the ratio 2mna-Although this analysis was performed only for FFT algorithms, experience suggests that the delays caused by packetswitched networks are relatively unimportant on most compute bound problems.
V. CONCLUSION
This paper has considered three basic families of multiprocessors and the analysis of communication complexity in the algorithms for these architecture classes. The principal goal here was to look at communication and its impact on algorithm performance. For large shared memory multi- processors, analyzing communication turns out to be relatively straightforward. The main issues are memory latency and finding ways to organize or substructure problems to minimize its effect. Studying algorithms on nonshared memory machines is more difficult since the topology of the communication network is a central issue. Our analysis of nonshared memory network-based machines was divided into two parts, the first covering machines with a fixed or circuit-switched topology, and the second covering machines based on packet-switched networks.
On circuit-switched machines, techniques borrowed from VLSI complexity theory provide a nice tool for obtaining lower bounds on algorithm complexity. Given an interconnection topology, one can, with relative ease, compute upper bounds on efficiency of the problem solution. An important point here is that these are upper bounds on the problem (e.g., FFT, fast Poisson solve, direct solution of tridiagonal systems), not on any particular implementation of an algorithm for solving the problem. In the cases studied, these upper bounds are apparently quite tight; in two of the three cases studied, these upper bounds are actually attained.
By contrast, analysis of algorithms on machines interconnected by a large packet-switching network is far easier, given our simple model of the behavior of packet-switching networks. Here the propagation delay parameter y, modeling the impact of packet contention, is quite important. But on most large problems, it seems to be possible to substructure the problem so that the effect of y is minor. With our model of a packet-switched network in which such a network is treated as a crossbar switch with delay, analysis of algorithms is no more difficult than for shared memory multiprocessors. (In fact, the delay y.is closely related to the value nt?2.) At 
