We consider modeling, predicting and evaluating the perfonnance of methods for solving PDEs in parallel architectures. We have developed a method for coarse grain partitioning of computations for parallel architectures and we apply it to three PDE applications:
and models of their communication. We apply this method to the three applications implemented on the FLEX/32 multicomputer. We review the architecture on the FLEX/32 and the results of applying the partitioning method to computation running on the FLEX/32. We observe that the FLEX/32 does not have any communication bottleneck and probably will not suffer substantial perfonnance degradation if the processor speeds are increased by a factor of 10. Our partitioning method works reasonably well even here where communication costs are negligible. The coarse grain structure of two of these applications is not highly parallel and we observe speedups of about k/2 for k processors. The other application is highly parallel and we observe optimal speedups for any number of processors as the problem size increases.
'loThis research supported in part by NSF grant DMC-8508684. *"'This research supponed in pan by AFOSR granl 84·0385. This paper will appear in the January 1987 issue of Parallel Computing.
Due to the lack of space, we discuss the parallelism reduction in depth elsewhere [HOUS86] , and assume here that this step is not needed. Given that steps 1 and 2 are done, step 3 is Cairly easy for the class of machines considered here. Thus, this paper concentrates on the problem of scheduling tbe application so tbat it runs efficiently and on evaluating the performance of these schedules.
Our overall plan for making performance evaluations is as follows. We first devise
fast, heuristic algoritbms wbicb carry out steps 1, 2 and 3 above. Note that the order of steps 1 and 2 is not necessarily fixed, we assume in the discussions that step 1 precedes step 2 so as to simplify the discussion but this is not an essential ingredient to our approach. OUf heuristics use communication delay models for "uniform" architectures in order to simplify the complexity of the mapping problem. Once we have the application mapped onto the machine and scheduled to provide good efficiency, we then run the program on an actual machine (we use a FLEX/32 multiprocessor) or use a simulation package, (for example, SIM:ONIFUn85]) to llexecute" a "compressed" revision of the program and to estimate its performance.
The performance of a distributed system depends very much on how well the architecture and tbe algorithms are matched IKLEIN85]. A methodology for predicting multiprocessor performance from the systems point of view and the RP3 architecture is presented in [NORT85J. The problem of mapping has been studied in [BOKH81J and [BERM84, 85] for the finite element macbine and CHlP architectures, respectively. In [WILL83J various objective criteria necessary for assigning processes to processors are examined and tested in a distributed environment. A different approach for modeling communications complexity of parallel algoritbms is presented in IGANN84J. Our approach to mapping POE compntations onto parallel macbines (i.e. step 2 of the mapping process) is described in IHOUS83j and our approach to compressing the parallelism is described in [HOUS86J.
In tbis paper we give models of three POE applications (Section 2) and discuss how to model communications for various machine architectures (Section 3). We then describe the FLEX/32 architecture and its performance (Section 4) and apply our mapping algorithm to the three POE applications (Section 5). The observed performance for these applications is discussed (Section 6) and compared with predictions Crom the model. Some conclusions are stated in Section 7.
APPLICATION MODELING WITH THREE EXAMPLES
For th~modeling of an application, we assume an initial partitioning of the computation into a numher of communicating compntational modules. If the communication paths among the modules are not known a priori (i.e. real time applications) then the partition is modeled by a stochastic AND-EOR directed graph SG(A). In [HOUS84] a stochastic a~alysis is applied to reduce SG(A} into a deterministic graph G(A) which -4comprises the input to the mapping model.
IC the communication paths among modules are known a priori then we use complexity analysis and/or simulation to obtain values for the parameters of the computation graph G(A). In the case of simulation, we use the data flow language SIMON {FUJI85! and coucurrent C aud FORTRAN languages {FLEX85) to specify the computation modules and their communication and synchronization requirements. In order to execute SIMON programs we use our non-shared memory multiprocessor simulator {FUJI85). Concurrent C or FORTRAN programs arc run on FLEX/32 multiprocessor system. Next, we consider three scientific computations to be used for the evaluation of our methodology.
Appliclltion I: Cholesky decomposition of symmetric mlltrl••s
We consider the parallelizaUon of Cholesky decomposition of symmetric matrices. • 5 -
Computation of node (i,j) in SIMON data flow language for a parallel Cholesky decomposition algorithm.
A matrix is partitioned into blocks so that the maximum degree or para.llelism is equal to the number or processors. 1£ smaller blocks are used, then the parallelism reduction would be applied but we do not consider tbis case bere. Tbi, application provides an example where the parallelism is mo,Uy less tban maximum and tbus wbere the speedup should be considerably less than tbe number or processors. See [O'LEA86] for furtber discussions. That paper sbows tbat optimal speedup is acbievable witb a speedup of N/g for N processors.
, " , Module processing times m = (24, 30, 30, 25, 27, 31, 37, 32, 27,34,38, 3g, 21, 28, 35, 45) 
Application II: Spline collocation method. for elliptic PDEs
We consider t.he numerica.l solution of an elliptic PDE problem defined by the second order elliptic operator Lu = au" + 2/1u,y +')1Iyy +6u, +, u y +(U = r, on ncm 2 subject to boundary conditions Bu = 0 on the boundary or n.
We use the line collocation method based on cuhic splines studied in lHOUS84!, [VANA85] to discretize the above PDE problem. For simplicity, we consider the The discretization (2.1) can be considered as the application of a set of equations to a data set. For example, the first block 01 equat.ions (2.la) is applied to data associated with the horizontal mesh lines (see Fig. ' 3a) and the (2.lb) equations are applied to data related to the vertical mesh lines (see Fig. 3b ). One partition technique is to divide the data sets (mesh lines) into part:;;, instead of the equations, and assume each part is assigned to a single processor. Figure 4 depict.s the parallel spline collocat.ion algorithm based on the above data set decomposition for a two processor system. The generalization is straight forward. In Section 6 we present the performance of this parallel algorithm for a. shared memory architecture.
HelmhoJtz equation with constant coefficients and an iterative method corresponding to SOR for solving the underlying linear equations. This discretization is defined by the block equations
Alx = (I-w) Alx-wA2y+wF, Q2y =(1-w)Qzy+wQIx-wG.
Application m. Large Bcale PDE computation.
This PDE application is complete from the description of a problem in a very high level language through the graphical display 01 the computed solut.ion. Figure 5 shows a schematic diagram with the steps 01 the computation, it has degree 01 parallelism 6. Not shown in Figure 5 are vertical data communication paths between corresponding nodes on the lan-in/lan-out 01 the tree structure in the center. The PDE problem solved is a general one on a Don-rectangular domain. A multifront method is used based on a nested dissection partition of the domain. Gauss elimination is used to eliminate unknowns in the interior of each domain. The numbers shown at each node are the "computation units" for a particular instance of this problem corresponding to using a 20 by 20 mesh, a finite element method with cubic basis functions, and a 40 by 40 plotting grid. The domain boundary has 3 pieces and, at the lourth step, all but one processor are working on the interior of the domain. A computational unit here is about 1000 arithmetic operations plus associated memory and control operations. -12 -
COMMUNICATION MODELING IN MULTIPROCESSOR ARCHITECTURES
Our methodology for reducing the initial interconnection graph G(A) is based on minimizing its communication cost. Thus, it is crucial to be able to predict the communication cost or G(A) running on a specific muJtiprocessor system architecture. We The delay obtained is only approximate because or the various simplifying assumptions involved in the analytical modeling of the performance characteristics (queueing, utilization) of the multiprocessor system. In all or the models we consider, two assumptions are made: (a) every processor accesses the interconnection network at the same rate, and (b} every processor accesses the various common memory modules with the same probability. These assumptions lead to a queueing model which is mathematically tractable. Intuitively, these assumptions imply that the communication reqnirements of G(A) are nearly uniform. rr G(A) is one of many applications running in the same system, these assumptions may be realistic even it the algorithm has somewhat -13-non~unirorm communication patterns. A precise investigation on this question is under way.
We consider the performance analysis of five different multiprocessor parallel architectures: (1) Siogle bus, common memory, (2) Single bus, distributed common memory modules, (3) Multiple bus, distributed common memory modules, (4) Omega intercon-nectioD, distributed memory modules, (5) Omega. interconnection processor to processor rnuJUprocessor system.
Single bUB common memory architecture
In this architecture, (Figure 7(i) ), k processors (Pi) are connected to aD external common memory (eM) via a global bus (GB). Each P; has a local memory (LM;) connected to its own local bus (LB;). The system has been analyzed in 1MAR83) as a lI mac hine repairman" problem. For given bus utilization level, say ull the average queueing delay per information transfer unit is given by D, (u,) 
Single bus distributed memory architecture
This architecture (Figure 7 (ii)), is obtained from the previous one hy distributing the common memory to each processor. The local memory of each Pi is logically divided in private memory PM j and common memory CM i and accessed by a double port.
This system has been analyzed in IMAR83]. The average queueing delay per infor-maHon transfer unit as a funelion of the bus utilization is given by
where Pk(p,) = .I;pJ (k-··)I J=O J . msg; p =average # of msgs entered by each P;/eyele time.
int.erconnection networks called Banyan networks [KRU83J. Some or these are the Omega, Delta or indirect cube interconnection networks. In our case, we have considered oxu switches where the capacity of the buffers is infinite. Then the average delay per message in traversing the interconnection network is simply its delay per stage times the number or sta.ges (Jognk) in the interconnection. 
Omega Interconnection PE to PE archItecture
In every architecture considered so Car we have d k " = 1. Figure 8 shows an architecture for which dk,Q ¢. 1. The organization of the processors and memories differs from that in Figure 7 (iv) in that each memory is associated with each processor forming a single element called the processing element (PE). Thus a PE to PE configuration i.
shown in Figure 8 . Note that this time the distance between the processors is dk,1 = 1,2. For the average queueing delay per message through tbe omega interconnection a formula is given which is the same as in architecture (3.4) but it is stated in more general terms by INOR85J and is as follows: -18net.works, delta, omega, indirect cube.
THE FLEX/32 ARCIDTECTURE AND PERFORMANCE
The FLEX/32 is a collection of computers that can each execute independent instruction streams and can separately access memory. The memory is shared and distributed in several modules as illustrated in Figure g . There are several choices of boards and they can be mixed.
The shared memory computers communicate by using the common memory. The intercommunication hardware for FLEX/32 is depicted in Figure 10 and consists of two common busses and ten local busses. Common busses are nonmultiplexed, synchronous with a clock rate of 12 Mhz, and with address width and data width of 32 bils. The communication mechanism is implemented by the Common Control Card (Gee) and the Common Access Cards (CAG). The GAG· card provides a local bus to common interfaces, contains 64 to 512 kbytes of common memory and hardware for interprocess The values of these parameters ror the Purdue FLEX/32 are given in Table 1 The FLEX/32 processors (NS 32032) were compared with VAX 780 single processor computer. It appears that each FLEX/32 computer based on NS 32032 is 1/3 of
VAX 780 Cor double precision floating point calculations and 1/2.65 Cor integer arith-
metic. This agrees closely with comparisons to the VAX 780 by other NS32032 based machines we have. The LINPACK subroutine DCHDC for Cholesky decomposition for symmetrie matriees was also used to eompare the performance of the FLEX/32-NS32032 processor and the VAX 780. Table 2 gives the performance of LINPACK double precision routiDe in C UDder FLEX/32-UNlX and MMOS. The data indicate that MMOS is slightly faster. Table 2 shows that a Fortran implementation of DCHDC is slightly more efficient than the C implementation. Table 2 also . Every computational module must fit into the memory assigned.
-Datablocks must fit into the memory "'<signed.
. Com"putations must have enough processor time.
-23·
(ii) parallelism constraint:
. Parallel computational modules can not he clustered, they are assigned~,o different processors.
(iii) artificial constraint:
. Processing time on each processor is limited to T (a parameter).
Tbe time frame parameter T is used implicitly to calculate tbe utilization (u) of tbe interconnection network of the machine M due to inlermodule communication traffic. The reduction of T forces more and more processors to be used.
We give tbe results of applying our metbod to map Applications I and 3 to tbe FLEX/32 architecture. Application 2 is not considered because there is an obvious mapping for it that gives optimal speedup.
The clustering algorithm
The solution of the reductjoD problem for a specific time frame T is achieved by the rollowing heuristic clustering strategy: 2,3,4,15,16 5,9,11,12,13 6,7,8 10, 14   5  1,2,3,4,5   6,11,16,18,21,  7,8,9,10 12 ,18,10,20,22 13,14,15 23,24, 25   6  1,2,3,4,5,   7,13,19,25,31  8,9,10,11,12   14,20,26,28,32  15,16,17,18  21,27, Table 3 . The clustering of the nodes of G(A) produced to obtain G' (A) by tbe heuristic algorithm. For k processors the nodes are numbered using the same pattern as in Figure 2 and then assigned to c1ust.ers as indicated in this tabl~.
Mapping application ill to the FLEX/32 architecture
The application of the heuristic mapping algorithm to the graph G(A) shown in Figures 5 and 6 gives the clusters shown in Figure 12 . Each node of G(A) is replaced by the cluster number to which it belongo in G' (A). The interlacing of the clusters makes the display such as in Figure 11 too complex. The time to con~pute this c1usterĩ og is 5.484 seconds. T are plotted in Figure 14 . These results were obtained using simulation on the FLEX/32. This observation is supported by our experimental results. Table 4 shows the effeet of using more processors than required for Application I (the 4 by 4, N=240 case).
On the beh avior of the mapping algorithm
Communication is 50 Cast in the FLEX/32 compared to computation that one sees little effect in actuality. We artificially increased the communication requirements by a factor of 100 and those results are shown in column 3 of Table 4 . Then the adverse effect oC having too many processors is evident.
The current implementation of the graph reduction pbase is capable, by using an iterative procedure, to identify all possible solutions and the breakpoints as in Figure   13 . Also, it is possible to estimate the elapsed time provided we are able to predict tbe 
PERFORMANCE OF APPLICATION/ARCmTECTURE PAIRS
In this section we present actual performance data for the three applications on the FLEX/32. We analyze the speedup obtained and correlate this with the computation structure and with the predicted values from our allocation algorithm.
ApplicatIon I/FLEX32 pair
Consider an N by N matrix and the Cholesky decomposition described in Section 2.1 and assume that is to be implemented using k processors. One way is to partition Table 4 . Effect 01 having more processors than required lor the parallelism 01 an al'" plication in the case of Cholesky decomposition with a 4 by 4 system and N=240 (see Figure 2 ). One should not expect speedup 01 about k (or efficiency almost 1) for thi, application. As ,een in Figure 2 , during the computation tbe parallelism of G(A) grow, steadily to its max imum and then decreases steadily back to 1. Thus a course grained partitioning such as we use should use, on the average, about balr the processors so we expect the efficiency to be about 0.5 which corresponds well with the observed results.
In order to study the effect of communication cost on the computation we have artificially increased it by a factor of 10 and 100. This increases the computational time for a 4x4 Cholesky decomposition graph N=240 and 4 active processors as follows: One of the objectives 01 our methodology is to predict the perlormance of some algorithm/architecture pairs. It appears that T pAR caD be used for this purpose. 6.2 Application II/FLEX32 pair A straight forward estimation of the computational complexity DC the spline collocation method indicates that it requires O(N 2 ) arithmetic operations per iteration where liN is the size or the partition in the x and y directions. Ideally, distributing the work among k processors reduces the computational complexity in a parallel algorithm to O(N 2 jk). The performance curve in Figure 14 indicates ideal speedup for large N. Table 7 presents the timings obtained with the FLEX/32 with a single processor and the VAX 780. These data indicate that the VAX 780 is about twice as fast as a FLEX/32 processor. This agrees with t.he data in Table 2 and supports the conjecture that the VAX 780 is not as much faster than FLEX/32 processor as the raw speeds of the processors would suggest. The speedup curves as shown in Figure 14 are typical of those seen in other applications and for other machines. This same data is redisplayed in Figure 15 , we plot the number of processors times the elapsed execution time for one itera.tion or this computation. If we had perfect speedup, then these curves would be constants. We observe that they have a slight rise, almost linear. In particular, each of these has risen about 3.5 at k =6. This suggests that
FLEX-UNIX FLEX-MMOS
Elapsed time = C,l'-'"/k + C, = 0.00135 N'/k + 0.75k which gives. Speedup = k(l + (k-1)C,/(C,N'))
Thus there is a very plausible fixed overhead or C z in the elapsed time per processor. The significance of this behavior is that C 2 is an overhead independent of N and thus ·34· one that becomes less significant as the problem size increases.
Application m/FLEX32 pair
Application m reqnire, 252.86 ,econd, to rnn on a single FLEX/32 processor. A brier examination or Figure 6 ,ngge,t, that a snbstantial ,peednp i, possible ror 6 processors, but somewhat less than 6 is expected. Tber~is a drop in parallelism in the center or the grap" (where tbe mnltirrontal metbod merges) which involve, substantial computation and the "create plot data structure" node is se.quential and sizable. When this application is run with 6 processors on the FLEX/32 using the clustering or Figure   12 , it takes 72,('.3 seconds. This is a speedup or 3.34.
An analysi, or Figure 5 shows that the maximum possible speedup i, 3.92 ir one assumes that communication is instant.aneous. Recall that our heuristic algorithm does not attempt to minimize elapsed time directly, so the speedup obtained is reasonable.
The time frame parameter T used to achieve the clustering is 66.37. This gives the predicted speedlip of 252.86/66.37 = 3.81 which is between maximum possible and the observed value.
CONCLUSIONS
We see that the heuristic partitioning algorithm works well even ror a machine like the FLEX/32 where communication costs are negligible compared to computational costs. The applications considered have a high correlation between communication and computational costs which is not surprising, this is probably rairly common. The seven processor ii"LEX/32 at Purdue exhibits no communication degradation even with aU processors continually accessing shared memory. We expect this situation to continue ir the number or processors is increased to the one cabinet limit (20 processors). Our projection is that the FLEX/32 boards can be npgrade,-: with processor te:: times raster than the current ones without adversely impacting most computations. We are surprised to see on two unrelated, arithmetic dominated applications that the FLEX/32 runs haIr as rast as a VAX 780 even though the VAX processors are 3 times raster.
Optimal speedup requires that the degree or paral1elism be maximal most or the time. Our approach only exploits coarse grain parallelism and two or the applications can use both coarse and fine grain parallelism. Even so, we observe good speedup or about k/2 ror k processors. The third application (spline collocation) naturally has high parallelism and, Cor any number oC processors, we observe optimal speedup as the prof>. lem size increases. In this case the lack of perfect speedup seems to be due to a small constant "start up cost" per processor. This small cost has a large apparent effect on traditional "speedup versus size" graphs. 180 and 220. Note that the difference trom a constant in these curves at k =6 is almost th e same.
