Abstract
Introduction
This paper investigates the time lost in a parallel computation due to different sources of inefficiency. Sequential and duplicated work, communication and control, and blocking are the major factors limiting the performance of a parallel computation. This paper is primarily concerned with control parallel algorithms and programs for MIMD systems. The computational model discussed considers P threads of control running concurrently and communicating with one another through abstractions called communication channels. Any interruption of the flow of control of any thread for communication and/or control is called an event.
The model is mapped to a particular MIMD architecture by associating threads of control with processing elements, PES, and communication channels with communication hardwarelsoftware. Events correspond to sendinglreceiving of messages for the message passing model suitable for distributed memory MIMD systems, or to access to shared data in a shared memory model for a shared memory MIMD system. Once the computational model is mapped to a MIMD architectures, one can make statements concerning the computing time, the work required to carry out the computation, the work intensity or the instruction execution rate, and about the performance of the ensemble consisting of the parallel algorithms, its implementation and the parallel architecture.
This paper discusses several experimental and theoretical questions pertinent to the performance characterization and analysis of parallel computations on MIMD systems. First it addresses the problem from the point of view of a practitioner and shows that measuring the speedup is often a difficult, if not impossible task, due to time and space limitations of the sequential computation. Furthermore, the results of a speedup measurement can be misleading. Instead of reflecting an efficient parallel computation, a large speedup may be due to an inefficient sequential computation.
The concept of relative speedup introduced in $2 reflects the fact that space and time requirements impose a minimum number of PES, Pmin > 1 for massively parallel computations, and that it only makes sense to compare the execution times of a parallel computation over a range of PES, (Pmin 5 P 5 Pmaz).
Communication complexity, the blocking model and the number of outstanding messages are suggested as means to provide an architecture independent characterization of a parallel algorithm [9] . A model of parallel execution which takes into account the effects of sequential and duplicated work, communication complexity and blocking is presented in 53. An incore 3-D FFT algorithm for a hypercube is discussed in 54 and an experiment in monitoring a Chebyshev iterative algorithm for solving a linear system of equations is presented in $5.
The Speedup and Other Measures of Performance
The speedup of a parallel computation on a multiprocessor with P identical processing elements is the ratio with T ( l ) the execution time of a serial implementation and T ( P ) the one of a parallel implementation using P processing elements. The speedup curve S = S ( P ) is the graph of the speedup as a function of P, the number of PES.
The speedup is considered a simple and expressive way to measure the overall performance of a parallel algorithm and of its implementation on a parallel architecture. The speedup appeals to the practitioner interested in a simple way to measure the performance of a parallel application and to those interested in more theoretical aspects of parallel algorithm design, parallel architectures, and performance. The speedup provides an elegant way to reason about the asymptotic behavior of a parallel computation, and upper bounds for the speedup can be derived in the framework of a theoretical model [l] , [8] . In the same time, it seems relatively easy to measure accurately the speedup of any application.
Yet the virtues of the speedup concept must be reexamined as our understanding of parallel computing and our experience in using massively parallel computers grow. The question whether the problem size should be allowed to grow when the number of PES increases or if fixed-size problems should be considered, has lead to the introduction of the concept of the scaled speedup [3] . Allowing the size for a problem to grow subject to an upper bound on the execution time, produces different results than considering space (memory) constraints as pointed out in [lo].
The scaled speedup addresses the concern that the parallel execution time, the efficiency, and the speedup for a small, fixed size problem, can only be very small when the number of PES allocated to the problem exceeds a certain range. But the converse is also true. Given large, fixed size problems suitable for a configuration with a large number of PES, the execution time of the corresponding sequential computation can only be very large. Therefore, for massively parallel computations, it seems reasonable to define a range of Pmin to P, , , processing elements and compare the execution time over that range only.
From a practical standpoint, it is very difficult and often impossible to measure the speedup of very large problems running on existing MIMD systems. MIMD systems with hundreds of PES are in use today and promising results for different applications running on such systems are reported. For example, applications requiring lo4 -lo5 seconds on a 520 node Touchstone Delta are described [ll]. The performance of such computation is given in terms of the Mflops rate rather than in terms of the speedup, simply because the execution time of a sequential implementation is prohibitively large and cannot be measured.
Even for small problems, it is often impossible to measure the execution time of the sequential implementation due to storage constraints, One of the main attractions of distributed memory MIMD systems is the large amount of storage available. It is very unlikely that an application which needs all the 8 Gbytes of storage available on 520 PES of the Touchstone Delta (or 32 Gbytes on a 1000 P E CM5), is able to run in the 16 Mbytes available on a single node of the Delta (or 32 Mbytes on a node of CM5). In Section 4, we discuss the performance of a 3-D FFT computation on an INTEL iPSC/860 with 16 Mbytes/node. The measurements show that, even for relatively modest sizes of a 3-D mesh, the computation cannot be carried out in one node only, due to storage limitations. The dimension of the mesh considered were in the 16 to 8192 range. In about 30% of the cases 70 out of determined because the problem was too large to run in a single node.
A further complication is due to the fact that the instruction execution rate (the Mflops rate) of modern RISC processors like the i860 used to build massively parallel systems, is very sensitive to the cache management. Experiments reported in [4] show that, depending upon the cache usage, the BLAS routines run at rates ranging from 10 Mflops to the peak rate of 60 Mflops. Experiments with the 3-D F F T computation, discussed in Section 4, show that the Mflops rates for a problem of constant size, but with different mesh shapes differ by a factor of as much as 2.5. Yet the speedup is based upon the implicit assumption that the sequential and the parallel implementation used to measure the speedup run on a computing engine with a fixed instruction execution rate. If the instruction rate is not constant, the speedup may appear artificially high, even though the aggregate Mflops rate is low, simply because the sequential implementation is inefficient. In Section 4, it is shown that a 3-D F F T computation of constant size (262144 nodes), runs at only 75 Mflops and exhibits a speedup of about 10 on a 16 node iPSC/S60 when the mesh shape is 16 x 16 x 1024, but runs at about 125 Mflops with a speedup of only 7 for a 64 x 64 x 64 mesh.
To account for this effect, the speedup inflation rate, i n f l ( P ) = w, is introduced and it is suggested that the eflective speedup Sef ( P ) = .
* J be used instead of the speedup. In this expression I ( P ) represents the instruction execution rate with P processing elements and I ( 1) the rate when only one PE is used.
To provide a practical measure of performance for problems of growing size, the concept of relative speedup is introduced. Let Q be the smallest number of PES that can be used to run a problem of fixed size.
The relative speedup SP,Q is defined as 210 different mesh sizes), the speedup cou I d not be Clearly SP,Q = w. The relative speedup curve is then the function While the relative speedup provides a practical measure of performance useful for a practitioner interested in performance analysis, we propose to investigate other measures of performance for the parallel algorithms, the implementation and their suitability to a particular parallel architecture.
The communication complezity relates the total number of events, E, to the number of threads of control, P; it provides an architecture independent measure of performance of a parallel algorithm. There are embarrassingly parallel algorithms with E = 0(P) Yet another characterization of a parallel algorithm important for implementation on distributed memory MIMD systems, is the number of outstanding messages. The message passing programming model is where a thread of control produces intermediate data, performs an explicit action to send these data to other thread(s) which consume the data. To avoid blocking phenomena, namely consumer threads waiting for the data to be produced, an algorithm may attempt to produce the intermediate results as soon as possible. But in this case, the intermediate data must be buffered either at the sender's site, at the consumer's site, or within the system communication network. In case of massively parallel computations, intermediate results are produced at a hi h rate, due to the large number of threads of controf, and, if their lifetime is significant, then all the storage space is exhausted and the computation deadlocks.
The 3-D FFT computation discussed in $4, avoids this type of problem and ensures that data is consumed as soon as it is produced. The strategy is to group the threads of control into pairs, to have both PES allocated to threads in the same pair first act as consumers by issuin an asynchronous receive, then synchronize and finalfy send the data. In this case, the number of outstanding messages is zero. At the other end of the spectrum are algorithms with P threads of control broadcasting data, potentially at the same computations can be rather inefficient if the 6 locking time, so that each PE may have (P -1) outstanding messages at some time and the total number of outstanding messages may be P(P -1). The maximum number of outstanding messages at any one time is v ( P ) = 0 for the 3-D FFT algorithm and v ( P ) = (?(Pa) for the algorithm when all PES broadcast possibly at the same time.
A Model of Parallel Execution and Upper Bounds of the Speedup
A model of parallel computation is discussed in this section which is able to predict the asymptotic behavior, the maximum value for the speedup, and the o p timum number of PES. The model takes into account different sources of inefficiency in parallel computing. The effects of strictly sequential and duplicated work, the effect of the communication complexity and of blocking are discussed.
Sequential and Duplicated Work
Amdahl has shown that only rarely the entire work required by a computation can be carried out in parallel. I f s denotes the fraction of the strictly sequential computation, Amdahl's law [l] shows that an upper bound for the asymptotic speedup is S 5 l / ( l + s and the maximum speedup.
In addition to the strictly sequential work, there is an additional obstacle in reaching a large speedup when parallelizing a computation, namely some work needs to be duplicated by several or all threads of control in order to reduce the amount of communication and/or blocking. Computation of the trigonometric functions for a Fourier expansion is a good example of work duplicated in parallel FFT algorithms.
In the model presented in this paper, f denotes the fraction of the strictly sequential work, plus the fraction of the work duplicated by all the threads, and W,,(P) the total amount of additional work due to duplication and strictly sequential computation. Then Wld(P) reflects the fact that each of the (P -1) threads wastes f x W(1) cycles, either by being idle when only one thread carries out the strictly sequential computation or by replicating the work of one thread. W(1) denotes the work carried out when only one thread of control is running. Then we have that only a relatively few PES are needed to ac h ieve
It follows that when only duplicated and sequential work is taken into account the total work with P threads is
W ( P ) = w ( l ) + W , d ( P ) = W ( l ) X ( l + f x ( P -l ) ) .
We assume that the relationship between the execution time, T(P , the total amount of work or computing cycles, d ( P ) , and the work intensity or instruction execution rate I is
and it follows immediately that
This is precisely the Amdhal's law, but here f represents the fraction of both sequential and duplicated w o r t . To reach a fraction q of the asymptotic speedup, S, = l/f, (0 5 q 5 1) then Pq processing elements are necessary with
For example when f = 0.01, the asymptotic speedup is S, = 100. A speedup of 80 (q = 0.8) can be reached with Po.8 = 400 processing elements.
Communication complexity and the asymptotic speedup
Consider a parallel computation with P threads of control which need to communicate with one another. Call any interruption of a thread of control for communication and control an event, and denote the total number of events by E, and the average amount of work associated with an event by 0, 6 
Following the same arguments concerning the timework relationship, it follows that P
Two more cases are considered now, namely E = k P l o g P and E = k P 2 . In both cases, the speedup reaches a maximum Sma, = S(P,,t) and then goes to zero asymptotically. The maximum speedups are, respectively, 1
The speedup S( P ) with P processing elements and the e$ciency, q(P), are related by
It follows that in the two cases examined above, the efficiencies are, respectively, The effect of the communication complexity upon the speedup curve is shown in Figure 1 .
The speedups S:;;, SkPlogp) and q;;) for the three cases E = O ( P ) , E = O ( P log P ) and E = O( P2).
The instruction execution rate and
Whenever the speedup of a particular computation is determined experimentally, it is implicitly assumed that the instruction execution rate is the same when T(1) and T ( P ) are measured. The model presented so far in this paper is based on the same reasonable assumption, namely that W(1) = IxT(1) and W ( P ) = P x I x T ( P ) for P > 1.
Yet this assumption may prove to be false and measurements reporting spectacular speedups may be misleading. To illustrate this effect a very simple experiment and its results are outlined below. The experiment consists of running a problem of fixed size on a variable number of PES. The problem is to trans form a vector of fixed length and use the saxpy BLAS routine available from the iPSC/SSO math library.
The computation is repeated for three different vector lengths. In the first case, the vector length is so large that even when it is distributed to eight PES the cache of each PE cannot hold all the data assigned to it. Then the vector length is decreased to the point where a single PE cannot hold the vector in its data cache, but when the data is distributed to eight PES, each of them can keep its data in the cache. In the third case, a vector of small length is used. Even when one PE is used the vector fits into the data cache. To keep the total amount of work the same, the product of the vector length to the number of iterations is kept constant. It follows that a model of parallel execution should assume that the instruction execution rate, I, is also a function of P and that W ( P ) = P x I ( P ) x T(P).
The effects of blocking and idle
Blocking occurs when a thread of control wastes its cycles waiting for data produced by another thread (see Figure 5 in Section 5 ) . If Tb,k(P) denotes the blocking time and Tcalc(P) the computing time with P processing elements, then threads of control With this view of a 3-D FFT, the obvious data partitioning strategy of an in-core computation is to assign to each PE one or more planes for the 2-D FFT t r a n s formation. Such a group of planes is called a slab. A slab orthogonal to the y axis is called an y-slab and one orthogonal to the t axis is called a z-slab. Figure 2 illustrates this data partitioning and shows that each PE needs to gather the slices of a slab assigned to all other PES for the second step of the algorithm.
T(P)
=
The Implementation
The kernel of a program [6] which implements these ideas, is presented in Figure 3 . The node program uses two system calls, numnodes and mynode to get the size of the partition and the id of the current PE. Then the number and the orientation of planes in a slab, the slab width, d , and d , , and size, sb, for y-slabs and tslabs, and the slice size, sl are determined. The PE performs a 2-D FFT on t i e d, planes of the y-slab assigned to it. The second for-loop distributes slices of the y-slab processed by the PE to all other PES, An algorithm with E = O(P) is now outlined.
Consider a minimum spanning tree like the one used for the broadcast/collapse mechanism in [7] . In the first phase of the algorithm, each node receives the yslab(s) processed by its children, creates a super-slab and sends it to its parent. This phase terminates when the root node receives the entire 3-D mesh. Then the root performs the transposition of the mesh, creates super-a-slabs and sends them to its children. In turn the children distribute super-t-slabs and the process terminates when the leaves receive their z-slabs. It is very likely that the blocking time of the O(P) algorithm will be larger than the one for the O(P2) algorithm. Even more important, the space requirements of this algorithm are considerably larger than those of the original algorithm analyzed in this section. There is one node (the root of the minimum spanning tree) which must have enough space for the entire 3-D mesh. Each of its children must have space for half the 3-D mesh and so on.
Measurements
An experiment performed on an INTEL iPSC/860 consists of running a problem of fixed size N given), of running the problem on a single node are reported in Table 2 and show that even though the amount of work, W(1) = k' x N x logN is the same, the i860 processor runs at different rates ranging from 7.5 to 17 Mflops. A plausible explanation in tune with results reported elsewhere [4] , is that the i860 processor is very sensitive to the cache management. When the mesh is highly asymmetric (16 x 16 x 1024) , the cache management is poor and the processor runs at a low rate, only 7.5 Mflops, but in the symmetric case 64 x 64 x 64 mesh), the cache management works we1 I and the Mflops rate increases by a factor of almost 2.5.
Therefore, it is very plausible for many computations that each PE will run at one rate (probably higher) when the problem is solved with P processing elements and data is distributed across P processors, and at a different rate, possibly a lower one, when using only one processing element and all data is stored in one node only so that the cache management is likely to be poor. If this is true, then a problem which exhibits a low Mflops rate running in one node only will show an artificially high speedup with P processing elements. Conversely a problem which runs well (at a high Mflops rate in one node will exhibit a because it does not benefit from an increase in the Mflops rate per processor.
The results shown in Table 3 confirm this suspicion. Indeed, when running on a 16 node iPSC/860, the highest speedup is observed for the problem running at the lowest Mtlops rate on 16 nodes and on one node. Table 3 : The instruction execution rate and the speedup for a 16 node iPSC/860 running the 3-D FFT computations for a mesh of constant size, but with different shapes. The instruction rate per processor is given, compare with Table 2 .
A Chebyshev Iterative Algorithm for Solving a Linear System of Equations
Iterative methods are freauentlv used in different areas of scientific computing.. For example, the phase refinement used in X-ray crystallography [ll] is based upon an iterative scheme. Such methods are used in linear algebra for solving systems of linear equations. To guarantee numerical convergence iterative methods y-slab and needs (nproc-1) slices for the (me)-th often require global synchronization after each iteraz-slab from all other nodes. */ tion. Excessive communication and blocking are limiting the performance of such methods on distributed memory MIMD systems [9] . Schemes which require less frequent synchronization are discussed in [7] .
We describe two experiments to measure and con- The detailed behavior of all threads of control was captured by recording all the events, marking changes of state for every thread. For every event a trace record, which contains the pertinent information about the event, type, time stamp, PE, amount of data transferred, etc. is created. All the measurements reported are based upon a clock with resolution Figure 4 : The expected number of events per thread of control and a 95% confidence interval for it. This data is for the two cases of the first experiment. This implementation uses broadcasting rather than multicasting for two reasons. First, the particular machine the experiment was carried on does not support efficient multicasting. Second, to multicast, each subdomain needs to know the ids of the PE where its neighboring subdomains are assigned. An algorithm which statically assigns subdomains to PES so as to preserve geometric proximity was not available, due to the complexity of the subdomain shapes. Thus, efficient multicasting was not possible even if it were available. Some events lead to blocking which is an important source of low processor utilization and speedup. The algorithmic blocking occurs when the need for data at the consumer's site precedes the actual generation of data at the producer's site. The propagation delay measures the time it takes for one bit of information to travel from the source to the destination. It depends upon the interconnection network and upon the routing method used, it may include some time needed to establish the connection. The data transmission time measures the time to send the data and it depends upon the amount of data being transmitted and upon the hardware speed.
Blocking is a major factor in the poor performance seen in the first experiment. It accounts for about 70% of the read time for 4 processors; this fraction rises to about 92% for 64 and 128 processors. Table  4 shows the expected algorithmic blocking time per read operation. We see that most of the blocking is in fact algorithmic blocking, therefore increasing the communication speed is not likely to have a significant effect upon this computation.
The second experiment [2] is for a completely new implementation based on the same underlying numerical method. The Ncube I1 is used along with a new communication scheme that eliminates most of the broadcasts. The domains are assigned to processors so that every exchange of data between domains is made between physical neighbors on the hypercube. Further, the exchanges are organized so there is never any contention in the communication. Roughly speaking, the exchanges follow the pattern: all processors send to west neighbors, then east neighbors, then south neighbors, and finally, north neighbors. A global synchronization is still required in order to test for terminating the iteration and associated computations. This is accomplished by log (P) local 4-way exchanges as described above) at ea% iteration. One E = O(P log2 P), a great improvement in the signature function compared to the first experiment. The performance observed is quite satisfactory in the second experiment, sample results are given in Table 5 .
Conclusions
This paper discusses the use of the speedup as a measure of performance of a parallel computer. It sees that t 6 e number E of events is now given by points out that execution time and the space requirements of some of the applications running on distributed memory MIMD systems like the Touchstone Delta system are so large that speedup cannot be determined simply because the sequential computation requires a prohibitively large amount of time. It also shows that the processor architecture, in particular the amount of cache available can distort the results of simple minded measurements. Superlinear and/or very large speedups may be due to an inefficient sequential computation rather than a very efficient parallel one. One of several causes of inefficiency of the sequential computation is poor cache management due to a large data space.
To preserve the virtues of a dimensionless measure of performance we propose the relative speedup which compares the execution time over a range of processing elements. This range should be determined so that the execution is carried out in comparable terms, e.g., similar instruction execution rates, measurable execution time and so on.
A model of a parallel execution which takes into account the major causes of inefficiency in a parallel computation, the sequential and duplicated work, communication and control as well as blocking is presented. This model allows the characterization of a parallel algorithm independent of the architecture. For example, an algorithm with communication complexity E = U ( P ) is more scalable than one with E = U ( P 2 ) on any architecture. This means that the maximum speedup attainable is higher and can be obtained with a larger number of PES as shown in $3.2. Yet even this high level characterization can be misleading. For example the 3-D FFT algorithm discussed in $4 has a communication complexity E = U ( P 2 ) but allows parallel communication. The set of PES is partitioned into groups of size 2, all groups communicate in parallel and there is no contention for communication channels. Such an algorithm may run more efficiently than an U ( P ) algorithm in which communication is strictly sequential.
The blocking model of a parallel algorithm is very important. An asynchronous algorithm is more efficient than one which requires global synchronization.
An algorithm in which the blocking time Talk = U ( P ) is less efficient than one with Tb1k = O(1ogP). This model also explains well the performance of the iterative algorithm described in Section 5. Last but not least the number of outstanding messages has a significant impact upon the performance of the parallel algorithm.
Multiple tradeoffs are involved in the design of a parallel algorithm. Communication, space, time, and control complexities have to be balanced in an optimal way to achieve the best performance on a given architecture. For example, the communication complexity can be reduced by increasing the space requirements as shown in $4.3. The communication complexity can be reduced by increasing the amount of the duplicated work. The number of outstanding messages can be reduced by increasing the blocking time as the scheme discussed in $4.2 shows. The scalability of an algorithm can be traded off for schemes which allow parallel and contention free communication.
