In this paper, we study the problem of scheduling parallel loops at compile time for a heterogeneous network of workstations. We consider heterogeneity in various aspects of parallel programming: program, processor, memory and network. A heterogeneous program has parallel loops with different amounts of work being done in each iteration; heterogeneous processors have different speeds; heterogeneous memory refers to the different amounts of user-available memory on the machines and a heterogeneous network has different communication costs between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and near-optimal schedules of loops for load balancing, communication optimizations, network contention and memory heterogeneity. Experiments show that a signi"cant performance improvement is achieved using our techniques.
INTRODUCTION
Network-based distributed computing has attracted a lot of attention lately, due to the recent advances in highspeed networks (e.g. ATM, asynchronous transfer mode) having low latency and high bandwidth, and due to the ubiquitous presence of workstations linked over localarea networks (LANs).
With most of the machines in a network underutilized, there has been research on harnessing this power in a useful way, for example, to use the workstations to solve the so-called`grand challenge' problems. Furthermore, such a network of machines, the`virtual parallel machine', may consist of possibly different types of processors, along with vector and multiprocessor machines. The inherently dynamic nature of the con"guration of the virtual machine, depending on which machines are available at the time of running the program, makes architecture-dependent programming almost impossible. The role of generating an ef"cient parallel code must be "lled by compilers.
The sources of heterogeneity in a network of workstations (NOW) include the processors, with processors of different speeds; the memory, with different amounts of available memory on different machines; the network, with varying communication costs among pairs of processors and at the program level, where the program may have parallel loops that have varying amounts of work in each iteration. There are a number of research issues that emerge in this environment, such as:
Program domain. Parallel applications fall into a number of categories. These programs may have regular or irregular computation and communication, or they may be composed of several subtasks with different processor-machine af"nities. Other characteristics, such as the communication-to-computation ratio, could dictate the decision to parallelize and the parallelization used. Different parallelizations. In heterogeneous environments, problem decomposition and task placement can have dramatic effects on performance. Depending on the underlying machine architecture and other machine-speci"c characteristics, different parallelizations may be required for good performance on different machines. Machine-processor heterogeneity. A heterogeneous system may consist of various shared and distributedmemory MIMD machines, SIMD and vector machines, and sequential workstations interconnected by a network. This has a signi"cant impact on scheduling and load balancing. Processor selection. Typically a large number of machines may be available for use, but we have to select the optimal subset of these machines which will give us the minimum overall execution time. We have to tradeoff increased computation power versus the increased overhead as we increase the number of machines. Mapping. Both the programs and machines may have certain characteristics which require the subtasks to be mapped to speci"c machines, to obtain the best performance [1, 2] . Memory. The amount of physical memory may be different for different machines. When we want to decide on the largest problem that can be performed ef"ciently, we must consider the available memory on each machine, and also the memory requirements of the application. We will discuss this point in more detail in Section 7. Network latency. Network latency is one of the primary concerns for a heterogeneous NOW. High latency can make communication extremely expensive, and restrict the scalability of the system. Network bandwidth. Bandwidth is also a bottleneck, especially for Ethernet LANs. Although it is easier to increase the physical bandwidth (e.g. ATMs have a much higher bandwidth), the amount of applicationlevel bandwidth remains a small fraction of the available physical bandwidth [3] . With different interconnection networks, the network heterogeneity can become a signi"cant factor in the parallel performance of applications. Contention effects. Communication on an Ethernet LAN is more expensive due to high latency and low bandwidth. The network traf"c tends to be highly bursty on LANs. Moreover, contention for the bus becomes a critical performance factor. Any performance prediction model for a heterogeneous NOW must take into account the contention that may be caused in the network. Modelling this is a very complex task. Load balancing. Homogeneous static load-balancing algorithms must be adapted to work for heterogeneous NOWs. For a multi-user set-up, run-time dynamic load balancing may be required [4] . We have to tradeoff the task-switching cost versus the load-imbalance cost. There are many dynamic load-balancing schemes [5, 6] , but these cannot be used for problems with subtasks of various capabilities. Since NOWs are usually loosely coupled, i.e. are connected via a nondedicated network, there is also the issue of external load on the network. Data coercion. Machine heterogeneity entails different methods of data representation on different machines. Although this introduces the overhead of data conversion while sending messages among the machines, it is generally not that signi"cant [7] . Software issues. Differences in the host operating systems, "le systems, database systems, interprocess communication, compilers and languages available should be masked while dealing with heterogeneous systems. Ef"cient software systems are needed which automate most of the decisions that need to be made in these environments, such as automating the data decomposition, distribution, synchronization and communication for the applications across a wide range of platforms.
In this paper, we assume an SPMD/master-slave model of computation, i.e. all processes essentially execute the same program, but on different data sets. We further assume that all the parallelism comes from`doall' loops. Since all the tasks are similar, the problem consists of ef"cient data partitioning among a set of machines, taking into consideration the processor speeds and communication costs, so as to minimize the execution time of the program.
The objective of this paper is to propose compile-time techniques for scheduling parallel loops for a heterogeneous NOW. In particular, we make the following technical contributions:
• We propose a simple model for a heterogeneous network of machines. It serves as a conceptual starting point in compiling for load balancing and communication in such an environment.
• We show by experiments that the conventional ways of measuring processor speed and memory capacity are insuf"cient for a heterogeneous NOW. We show that normalized processor speed, which may be application dependent, gives a better estimate of processor performance, and that the resident memory size, which may also be application dependent, gives a better estimate of the memory requirement. Furthermore, we show how these two parameters taken together in#uence scheduling, and lead to better performance.
• We develop a set of architecture-conscious compiletime scheduling approaches for generating optimal or near-optimal scheduling of loops for load balancing and communication, for a network of heterogeneous machines.
• We present experimental results to verify that these techniques produce very good results in practice.
We show that the architecture-conscious scheduling algorithms result in much better performance than the naive architecture-oblivious scheduling approach. Examples are drawn from a mix of synthetic and real applications, from scienti"c computing and economics modelling [8] .
The rest of this paper is organized as follows. We will brie#y present the related work in the next section, before introducing our compile-time model. In Section 3, we introduce our program model, which is followed by our machine model in Section 4. In Section 5, we consider scheduling for heterogeneous programs (on homogeneous machines).
In Section 6, we look at the case of heterogeneous processors, with the same communication links, which is followed by scheduling for heterogeneous memory in Section 7. Section 8 deals with the case where the network communication links are heterogeneous, i.e. there are different communication costs between different points in the network. In Section 9, we extend our model to handle the case of scheduling for load balancing while avoiding network contention. We then present experimental results on our proposed techniques in Section 10. Finally, we conclude in Section 11.
RELATED WORK
In this section we look at some of the load-balancing schemes which have been proposed in the literature.
Static scheduling
Compile-time static loop scheduling is ef"cient and introduces no additional run-time overhead. For UMA (uniform memory access) parallel machines, usually loop iterations can be scheduled in a block or cyclic fashion. For NUMA (non-uniform memory access) parallel machines, loop scheduling has to take data distribution into account [9] . The simplest approach is the static-block scheduling scheme, which assigns an equal block of iterations to each of the available processors. The static interleaved scheme assigns iterations in a cyclic fashion [10] .
There has been relatively little work done on static scheduling for heterogeneous NOW; it has also mainly focused on homogeneous applications. Earlier work dealing with processor heterogeneity appears in [7, 11, 12] , and the requirements for distributed computing over LANs have been analysed in [3] . This paper presents compile-time static scheduling algorithms for heterogeneous programs, processors, memory and networks. Preliminary results of this paper can be found in [13] [14] [15] .
Dynamic scheduling
When the execution time of loop iterations is not predictable at compile time, run-time dynamic scheduling can be used at the additional run-time cost of managing task allocation. The dynamic scheduling strategies fall under different models, which include schemes based on predicting the future from past loads, the task queue model and the diffusion model.
Predicting the future.
A common approach taken for load balancing on a workstation network is to predict future performance based on past information. For example, in [16] , a global distributed scheme is presented, and load balancing involves periodic information exchanges. Dome [17] implements a global central and a local distributed scheme, and the load balancing involves periodic exchanges. Siegell [18] also presented a global centralized scheme, with periodic information exchanges. The main contribution of this paper was the methodology for automatic generation of parallel programs with dynamic load balancing. In Phish [19] , a local distributed receiver-initiated scheme is described, where the processor requesting more tasks chooses a processor at random from which to steal more work. CHARM [20] implements a local distributed receiver-initiated scheme. In [4, 21] the authors presented different strategies for dynamic load balancing in the presence of transient external load. They examined both global versus local, and centralized versus distributed schemes, and presented a hybrid compile and run-time system that automatically selects the best load-balancing scheme for a given loop/task from among the repertoire of different strategies. 
PROGRAM MODEL
In this paper we will look at parallel loops, that is, loops with iterations which do not depend on one another. We have to address two issues-the amount of computation and the amount of communication in each iteration of the parallel loop. We would like to generate schedules for the loop which are optimal, both in terms of communication and computation, to achieve the best speed-up on a heterogeneous NOW.
Parallel loops
To create a static schedule with a good load balance, we have to know the amount of computation in every iteration. We consider two cases of parallel loops: homogeneous and heterogeneous. By homogeneous loops, we mean parallel loops that have the same amount of computation in each iteration. Heterogeneous loops have a varying amount of work in each iteration. We introduce a program parameter into our model: x i , the number of operations in iteration i. For the homogeneous case, x i is constant; for the heterogeneous case, we assume that x i = ai + b, i.e. we restrict our attention to loops where the computation is an af"ne function of the normalized loop index, which captures a large set of scienti"c programs. Examples include Cholesky, TRFD from the Perfect benchmark suite [25] and SPEC95 benchmark applications.
Communication
With the right placement of data, a parallel loop does not require any communication during execution, i.e. there is no data #ow between iterations of a parallel loop. However, communication may be necessary between different loops. Therefore, there may be communication caused by each iteration because of subsequent computation. We assume that each iteration of the parallel loop contributes a precisely de"ned amount of data to the messages sent after the loop completes. Further, we assume that the messages are being sent after all iterations assigned to a given processor have completed, i.e. there is only one message per processor, consisting of data from all iterations, rather than one message per iteration.
The example below illustrates this type of communication.
Values Y (k) produced in some of the iterations of loop 1, will be used in more than one iteration of loop 2. Every iteration of loop 1 contributes one word to the message sent between loops 1 and 2.
We introduce the following program parameter which denotes the amount of communication: y i , the number of bytes that have to be sent as the result of iteration i. Communication contributed by each iteration can be constant (the homogeneous case)-as in the example above-or it can vary (the heterogeneous case). In the heterogeneous case, we assume that y i = ci + d, i.e. the amount of communication is an af"ne function of the normalized loop index.
MACHINE MODEL
The machine model must account for processor, memory and network heterogeneity. Below we present each case separately.
Processor model
In a fully detailed processor model, we would need to consider the speed of a processor in terms of the number of #oating point operations per second, and the number of integer operations per second. We also need to consider memory access time, and the interaction of these with different cache and memory sizes. Multiple instruction issue and instruction pipelining would further complicate the performance model. The multitude of machine parameters makes their use in performance prediction very dif"cult. Therefore, for our discussion, we will only consider a single parameter, γ , to describe the speed of a machine: γ i is the time for one operation on processor i. The machines may be homogeneous or heterogeneous.
There are a number of ways to calculate γ , the speed of the processor; for example, we could use the MIPS (millions of instructions per second), MFLOPS (millions of #oating-point operations per second), Whetstone or the Dhrystone ratings. In modern processors different operations have a different cost, and furthermore, instruction pipelining and multiple instruction issue render it quite dif"cult to come up with a single "gure that characterizes the performance. Therefore, while these "gures may give an indication of the processor capabilities, a reliable and consistent performance measure can only be found by using the execution time of different real applications on the machines under consideration.
In our approach, we summarize the processor speeds via the notion of normalized processor speed (NPS), de"ned as the ratio of the time taken to execute on the processor under consideration, with respect to the time taken on a base processor. Consider Figures 1 and 2 , which show the processor performance of a SPARCstation 10 and LX on different applications (see Section 10 for a description of these applications). The time is normalized against the performance of a SPARCstation 1. Our experiments indicate that machine performance varies for different applications. Since the processor speeds vary from one application to another, we approximate the speed based on small trial application runs. On the other hand, we may obtain these by compile-time performance prediction. In [26] , the author describes a detailed, architecture-speci"c, compile-time performance-prediction framework. Porting to different architectures and compilers is quite involved, though possible. Table 1 shows the MIPS ratio and the normalized processor speeds for different applications, for the SPARC 5 versus LX. In Table 2 we show the execution times obtained on a con"guration of two machines-a SPARC 5 and an LX. The second and third columns show the execution time using the MIPS ratio and NPS values to balance the workload. It is clearly seen that the normalized processor speed should be used in scheduling, since it results in a balanced computation load and hence gives better performance for the application.
Memory model
Heterogeneous NOWs have a large amount of computational power as well as a large amount of combined memory. We would like to exploit the available resources by solving as large a problem as possible, for example, solving large instances of numerical scienti"c applications, and other real-world applications such as weather modelling, computational dynamics and other`grand challenge' applications. The largest data size is limited by the amount of combined memory present in the system. The amount of memory available may be different for different processors. To summarize the memory heterogeneity, we introduce the parameter m i , the amount of memory available on processor i. Table 3 shows the amount of actual physical memory and the amount that is available to user applications in our con"guration. We can use the above values to decide on the largest problem size we can run, by calculating when the total memory requirement of an application would exceed the user-available memory capacity on a given machine. Ways of estimating the memory requirement for an application will be discussed in Section 7. 
Network model
For a network of workstations, we also have to consider the cost of communication between any two machines, i.e. we must consider the interplay of latency and bandwidth between points in the network. Furthermore, when we talk of communication between two machines we must consider the cost of packing (marshalling) the data, receiving the data and the cost of the`real' communication, i.e. the time actually spent in the physical medium. We have two parameters for each of the above three cases-the startup time (independent of the message size) and the actual time spent in performing the action (proportional to the message size).
Rather than dealing with six or more parameters, we simplify our model and consider the startup time and the cost for the action to be the sum of the costs for all three stages, and thus have the following two parameters: α i , startup time for a message on processor i, and β i , the time to send one byte of data on processor i. The network of machines can be either homogeneous or heterogeneous. In the former case, α i = α and β i = β, for all the machines. In the latter case, these vary with the machine. The values for the latency and bandwidth are obtained via off-line network characterization experiments.
The discussion so far has assumed that messages from different machines can be sent at the same time. For many machines this is not a realistic assumption. Contention in the network adds complexity to the model. The discussion of this, more complex, case will be deferred until Section 9.
SCHEDULING FOR HETEROGENEOUS PROGRAMS
In this section we consider heterogeneous programs (parallel loops) on parallel machines with homogeneous processors and a homogeneous network. As discussed in Section 4, the following machine parameters describe this type of machine: p denotes the number of processors in the system, γ the time to execute one operation, α the communication initialization time, β the time to send one byte of a message and n the number of iterations of the loop. As a simple introduction to loop scheduling, we "rst consider homogeneous parallel loops without communication. Every processor has the same speed, every iteration requires the same amount of computation, and there is no communication. With these assumptions, every processor should execute approximately the same number of iterations. If n is a multiple of p, every processor will have exactly the same number of iterations: n/ p. Otherwise, some processors will execute n/ p while others will have n/ p +1 iterations. In this case, it is not important which processors have one more iteration to execute. In the case of homogeneous loops with communication, every iteration causes the same number of bytes to be sent. Therefore, the static scheduling above will evenly distribute both computation and communication.
Heterogeneous loops: no communication
For heterogeneous loops, again, we deal with the communication-free case "rst. As discussed earlier in this section, this type of loop is characterized by the parameter x i = ai + b, for i = 1, . . . , n. Rather than solving this problem directly, we will show how to transform this loop into a homogeneous parallel loop and use the scheduling strategy for homogeneous loops presented above. This transformation is shown graphically in Figure 3 .
Case I: n = 2pt, for some integer t
We "rst consider a special case when the number of iterations n is a multiple of 2 p. We can transform this loop into a homogeneous parallel loop with n/2 iterations. Note that the sum of the work in iterations i and (n − i + 1) is a constant:
We can therefore combine iterations i and (n − i + 1) into one iteration of a new parallel loop. This new loop is homogeneous with a(n + 1) + 2b operations in every iteration. As all processors execute exactly n/ p iterations of the transformed loop, there is no imbalance.
Case II: n = 2pt, for any integer t
In the general case, there may be imbalance. Let r = n mod(2 p). If r = 0, the imbalance is caused by the remaining r iterations. We can make the imbalance very small by choosing those r iterations to be very short (the loop is not homogeneous). To achieve this, we take the "rst r iterations if a > 0, since we have an increasing amount of computation in this case, and the last r iterations otherwise. Now, if r ≤ p, then r processors get one iteration each, otherwise the "rst r mod p processors get two iterations (we can transform the 2(r mod p) consecutive iterations into a homogeneous loop), the remaining processors take the longest 2 p − r iterations. The schedule obtained in this way is close to optimal. We call this approach bitonic scheduling [13] , since the iterations are assigned to processors in an increasing and decreasing fashion.
We shall illustrate this optimization with the following example. Let the number of iterations, n = 10, and the number of processors, p = 3. Let x i = i, i.e. a = 1 and b = 0. To get the optimal schedule, we "rst compute r = 10 mod 6 = 4. Because a > 0, we take away the "rst four iterations. The last six iterations can be perfectly balanced with each processor getting two iterations. In our case processors 1, 2 and 3 get iterations 10, 5; 9, 6 and 8, 7 respectively. Since r > p, we compute r mod p = 4 mod 3 = 1, and thus, the "rst processor gets two iterations from the beginning, i.e. it gets iterations 1 and 2. The other two processors can pick up the two remaining iterations. So processors 2 and 3 get iterations 3 and 4 respectively. Figure 4b represents pictorially our discussion above; clearly, our schedule is optimal.
We will contrast our technique with another popular technique for load balancing. Often, iterations of heterogeneous loops are assigned in an interleaved fashionusing round-robin scheduling. For our example above, with interleaved scheduling, processor 1 gets iterations 1, 4, 7, 10; processor 2 gets iterations 2, 5, 8 and processor 3 gets iterations 3, 6, 9. The completion time using our schedule, 19 s, is shorter than the completion time using interleaving, 22 s. Figure 4c clearly shows that this strategy is non-optimal.
Heterogeneous loops: with communication
The case with communication can be handled with a slight modi"cation of the above transformation. Every iteration of the new parallel loop will cause c(n+1)+2d bytes to be sent. When 2 p divides n, the homogeneous loop obtained in this way can be scheduled as described in Subsubsection 5.1.1. When 2 p does not divide n, we have to use an approach similar to the approach used in Subsubsection 5.1.2. We "nd r = n mod (2 p). We can perfectly schedule n − r iterations, and we choose the r iterations, such that they are the`cheapest' in terms of the imbalance they produce, and assign them as before. The difference is that, this time we do not use the sign of a to determine which iterations are thè cheapest'. We have to use the sign of (γ a + βc) (recall that y = ci +d is the communication for iteration i), because this constant determines whether the time spent on computation and communication increases or decreases with the iteration number. 
SCHEDULING FOR HETEROGENEOUS PROCESSORS
In this section we consider parallel machines with heterogeneous processors and a homogeneous network. The following machine parameters describe these types of machines: p, the number of processors in the system, γ i , the time for processor i to execute one operation, α, the communication initialization time and β, the time to send one byte of a message. As usual, n denotes the number of iterations of the loop. We call the straightforward way of assigning the same amount of work to each processor the architecture-oblivious approach, and the algorithms developed in the following sections the architecture-conscious approach.
Homogeneous parallel loops: no communication
We create the schedules by trying to balance the computation on all the processors. We note that, to evenly distribute computation, every processor should have a fraction of all the work given by the following formula:
To get the optimal load balance, we should assign z i = w i n iterations to processor i. Since z i is not necessarily an integer number, we have to decide whether z i or z i should be used. If the iteration space is large this decision is not very critical. We break the tie in the following way. Processor i works on iterations i−1 k=1 z k + 1 through i k=1 z k . The schedule obtained in this way is optimal. A similar approach, by distributing the load proportionally to the relative speeds of the processors, has been used with success in [7] .
Homogeneous parallel loops: with communication
When there is communication, the algorithm in Subsection 6.1 will not necessarily generate an optimal schedule. Here we present an optimal solution. For the uniform case, x i = x and y i = y for i = 1, . . . , n. The communication time caused by z i iterations is α+βyz i (recall that all objects to be sent are packed into one message and sent after the computation has completed). Hence, the total time spent by processor i on computation and communication is
where g i = γ i x + βy. Note that for this to work, we have to ensure that γ i x and βy are in the same units, say, microseconds.
As in the other cases, our goal is to "nd a set of z i that minimizes max 
1/g k and proceed as in Subsection 6.1.
Heterogeneous parallel loops
In this case, we can again "rst transform a heterogeneous loop into a homogeneous loop and then apply the methods described above. Note that, although this approach results in a schedule which is optimal for the transformed homogeneous loop, it is not necessarily optimal for the original heterogeneous loop. The possible load imbalance is, however, very small. The work assigned to each processor is different from the optimum by at most one iteration.
SCHEDULING FOR HETEROGENEOUS MEMORY
In this section we examine ways of estimating the memory requirement for an application. We then extend our scheduling algorithms to account for heterogeneous memory.
Resident memory size (RMS)
Our experiments show that using the total memory requirement is generally not a good criterion for judging the largest problem size we can run ef"ciently. Table 4 shows the results obtained for the matrix multiplication program on a con"guration having a SPARC 5 and a SPARC LX machine. We "rst distributed the work among the two machines proportionally to their NPS values, using the architecture-conscious technique from the last section. This distribution causes the total memory requirement for the SPARC 5 (28.0 Mb, column 3) to exceed the user-available memory for it (24.5 Mb). We then redistributed the data among the processors so that we respect the memory constraint on the SPARC 5. But this caused an increase in the execution time (see columns 4 and 6). The reason is that the total memory requirement is a very conservative measure, and generally overestimates the memory requirement of an application. We therefore introduce a new notion, the resident memory size (RMS) for a given program segment, de"ned as the minimum number of pages of physical memory required to ensure that all page fault misses are cold misses (i.e. due to the "rst reference) for that segment, using a particular page replacement algorithm. We believe that this gives a better indication of the memory requirement for an application. Note that RMS is a programlevel analogue of the operating system's notion of working set size (WSS) with an appropriate window size (WSS is de"ned as the set of pages in the most recent , the window size, page references).
For a particular application, as we increase the data size we will reach a critical point beyond which the performance of the program degrades rapidly. This critical data size cannot simply be obtained from the total memory requirement for the application. Usually the RMS should be a good approximation of this critical point. For example consider the matrix multiplication program, MXM, which computes C = A * B, where A, B and C are N × N matrices. The total memory requirement for this program is 3N
2 . However, notice that all three matrices need not occupy the memory at the same time. If we compute the C matrix, a row at a time, we need to keep only one page of C and one row of A in memory, but we must have the whole of matrix B in memory. Therefore, if we calculate the resident memory size for MXM, we get the following, approximate, formula:
The above RMS is calculated using an ideal page replacement scheme. Using the LRU (least recently used) page replacement instead, would give
We observe that if the resident memory size is less than the user-available memory then our program will not suffer from the effects of memory limitations. If, on the other hand, the program's RMS is larger than the available memory then some of the pages required will not be in memory, and we will have to take a page fault. As the input data size increases, the RMS increases, ultimately exceeding the available memory. If we attempt to run very large programs then we will cause the machines to thrash, severely degrading the performance.
We use a compile-time algorithm to approximate the RMS. We compute the number of pages contributed to RMS by every array reference in a loop nest. We "rst "nd the stride vector [27] for a given reference and then determine the outermost loop carrying reuse. For all loops enclosed by this loop we use strides and loop bounds to calculate the number of reused pages.
Let us illustrate the algorithm with an example. Consider the following loop nest from the matrix multiply program. 
For a given reference a stride vector has one element for every loop enclosing this reference. An element of a stride vector is equal to the memory stride for consecutive iterations of the corresponding loop. In our example, the bottom element of v a is 1, which means that the stride for accesses to array a in loop-k is unitary. The two other elements of v a inform that the stride in loop-j is 0, and the stride in loop-i is n. Stride vectors are used to describe the locality of memory accesses. Assume that a page holds p array elements and that 1 < p < n. Consider the reference to array a. We can see from the stride vector that there is temporal reuse carried by loop-j and spatial reuse carried by loop-k. The outermost loop carries no reuse.
For the reference to the array a, loop-j is the outermost loop with reuse. According to our algorithm, we consider all loops enclosed by the loop-j, that is loop-k. This reference contributes RMS a = 1n/ p = n/ p pages, where 1 is the stride in loop-k and n is the number of iterations of that loop.
Similarly for the reference to array b, loop-i is the outermost loop carrying reuse, and we have to consider all loops enclosed by it, i.e. loop-j and loop-k. Each of those loops has n iterations and the strides are 1 and n respectively. The number of pages (ignoring boundary conditions) is RMS b = n(n/ p), that is the number of iterations of loop-j multiplied by the number of pages referenced in loop-k.
Calculation of the RMS for the reference to array c is similar to RMS a . This time the stride in the innermost loop is 0. Hence, RMS c = 0n/ p = 0. Because we need at least one page to keep the current element of c in memory, we take RMS c = 1.
The resident memory size for all three arrays in this example is RMS = RMS a + RMS b + RMS c . Hence,
The result is the same as the formula shown earlier in this section for an ideal page replacement algorithm. The limitation of the above algorithm is that it is very conservative. While the RMS value obtained for regular problems should work well in practice, it may not be a good approximation for irregular problems.
Combined effect of processor and memory heterogeneity
In this subsection we point out how to ef"ciently run large problem instances on a particular con"guration of the NOW. We look at the interaction of the normalized processor speed and the resident memory size, both of which are application dependent, and show their combined effect on scheduling. Deciding on the largest problem instance to be solved is a subtle issue. It depends on a number of criteria, such as how long are we willing to wait? or what measure of ef"ciency do we desire?, etc. In this subsection, we will not deal with the problem of "nding the largest problem instance to solve. Instead, we will look at how we might achieve good performance, i.e. minimal execution time, for program instances where the RMS value exceeds the memory available to a user application on at least one processor in the NOW. Table 5 shows the results obtained for MXM (2788 × 2788) on a con"guration of SPARC 10 and SPARC 5 workstation. The SPARC 5 has approximately four times less memory than the SPARC 10 ( Table 3) . We "rst ran the program by distributing the work based on the NPS values, but the RMS (28.8 Mb) exceeded the memory on the SPARC 5 (24.5 Mb), and caused the machine to thrash. We had to stop the execution. We then distributed the data so that the RMS on the SPARC 5 was equal to the available memory (see under NPS + RMS). We also used the memory ratio of the machines to schedule the work (see under MEM), however this results in a load imbalance as more work is assigned to the SPARC 10, and thus it takes a longer time to complete. We can clearly see that the execution time obtained by using both the NPS and RMS values is the best, while using just the NPS values we could not even run on the chosen data size.
Scheduling algorithm
We "rst try to distribute the data among the processors in proportion to their NPS values for the particular application under consideration, using the algorithms from the previous sections. We also calculate the RMS value for the program. Using this RMS value and the user-available memory we determine whether we exceed the memory on any processor, and redistribute the excess amount among the other processors by recursively applying the same technique. The schedule obtained in this way tries to respect the processor speed ratios, and even when memory becomes a factor, it tries to be as close to the processor speed ratios as possible, while satisfying the memory constraints. This approach should give near-optimal performance for a given
To eliminate load imbalance caused by different communication startup times, α i , we "nd a processor with the largest value of α i , and we add extra iterations to processors with shorter times.
Let
The number of extra iterations for a given processor is
We can now use the algorithm from Subsection 6.2 on the remaining (n − p k=1 e k ) iterations to obtain z i and assign z i = z i + e i iterations to every processor.
The solution presented in this section is not necessarily optimal, but the schedule found by this algorithm is very close to the optimum. The work allocated to any processor is different by at most two iterations from the work corresponding to the perfect load balance.
Heterogeneous parallel loops: with communication
To schedule a heterogeneous loop with communication, we again transform it into a homogeneous parallel loop. Transforming the loop "rst and then applying the algorithm from Subsection 8.1, would cause the use of iterations of the transformed loop as extra iterations to balance the communication initialization cost. However, those iterations have a higher cost, both in terms of computation and communication, than any single iteration of the original loop. It is therefore desirable to eliminate the initialization imbalance "rst, and to then transform the remaining iterations into a homogeneous parallel loop.
Because differences between the parameters α i may be small in some cases, we want to use the iterations with the smallest cost possible. Note that by cost, in this context, we mean only the time contributed by an iteration, without the communication startup cost. The cost of an iteration i on a processor j is given by
To ensure the use of the shortest possible iterations, we should use iterations from the beginning of the iteration space if (γ j a + β j c) > 0, since in this case g i, j is an increasing function of i, and from the end of the iteration space otherwise. The sign of (γ j a + β j c) is machine dependent, so for some processors we should allocate iterations from the beginning of the iteration space, but for other processors, from the end. This general case can be handled by an extended version of our algorithm. In practice, however, constants a and c have the same sign, which implies the same sign for (γ j a + β j c) no matter what the machine parameters are (γ j and β j are always positive). Therefore, here we will describe a solution for this simpler case only.
We will start assigning extra iterations from iteration 1 upwards if (γ j a + β j c) > 0 and from iteration n downwards otherwise. Since the two cases are very similar, we will describe the "rst one only.
To simplify the notation, let us introduce
We will compute e i in order: e 1 , e 2 , . . . , e p . For a given processor i, we choose the maximum e i , such that
does not exceed (α j − α i ); that is, the time taken to execute the extra iterations, k, on processor i, is less than or equal to the initialization imbalance for that processor.
The next step transforms the parallel loop from E p to n into a homogeneous loop. In effect every processor has two sets of iterations to execute:
• iterations E i−1 + 1, . . . , E i of the original loop, and
As in the previous section, the schedule found here is suboptimal, although very close to the perfect balance. The difference for any processor between this schedule and the perfect balance again does not exceed two iterations.
SCHEDULING FOR CONTENTION AVOIDANCE
Sections 5, 6 and 8 considered a machine model which allowed messages sent from different machines to travel in the network at the same time-in parallel. On many existing parallel machines, for instance on a network of workstations using Ethernet as the interconnect, the performance will suffer if many messages are being sent at the same time. On such parallel multicomputers, it is desirable to schedule a parallel program in such a way that only one processor (workstation) sends a message at a given time.
We assume that the machines effectively sequentialize all messages; that is, at any given time only one message can be in transit in the physical medium, which is not accessible to the other processors until the send operation is completed. This model should be a good approximation of many busbased multicomputers.
Programs running on machines that sequentialize communication need a different set of optimizations. In this section we will describe a method to minimize execution time of a homogeneous parallel loop on a homogeneous multicomputer.
The model
We will extend the machine model discussed in the previous sections. The following parameters describe every processor in the parallel machines considered in this section:
• γ , the time to execute one operation;
• α , β , communication parameters for the part of the send operation performed locally (this is the part of communication that is not sequentialized); • α , β , communication parameters for the part of the send operation that requires access to a shared physical medium and which is sequentialized.
As before a homogeneous parallel loop is described by the following two parameters: x, the number of operations to be performed in one iteration and y, the length of the message caused by a single iteration, but as in the earlier sections messages from all iterations assigned to a given processor are combined and sent as one larger message.
There are p processors. Processor i works on z i iterations. If we assume that the message can be sent immediately with no contention (without waiting for other processors to free the shared communication medium), the total time to execute z i iterations and broadcast a message resulting from this iteration is
where t i = z i xγ + α + z i yβ is the work that can be performed locally without interference with other processors, and t i = α + z i yβ is the part of communication that has to be sequentialized.
In reality, every processor "rst performs local operations for time t i , and then waits until the shared medium becomes free and sends its data in time t i . During this period t i , other processors cannot send anything.
Without loss of generality, assume that processor i broadcasts its message before processor i + 1. This is justi"ed, because all processors are identical and their ordering is arbitrary. With this assumption, we can give the real time that the processor i spends on computation and communication:
where T 0 = 0. This formula expresses the simple fact that a processor cannot begin accessing the shared medium before its computation has completed (t i ), or before its predecessor has released the communication channel (T i−1 ).
Optimal schedule
A simple-minded strategy would assign the same number of iterations to every processor-all processors have the same speed and all iterations have the same cost. This strategy would cause each processor, except the "rst one, to wait for the communication channel. Moreover, every processor would wait longer than its predecessor. If we de"ne the total execution time to be the time when the last send completes, the execution time achieved by this strategy is not optimal. This fact is illustrated in Figure 5b .
We show a static scheduling strategy that is optimal in that it results in the shortest possible execution time on a given number of processors (that is, all available processors are used). 
Proof (sketch).
The total time spent on the sequentialized part of communication is the same for every schedule and is equal to p k=1 t k = pα + nyβ . Consider a schedule such that t i = T i−1 , for i = 2, . . . , p. Let us call it a contention-free schedule. We will show that any change in this schedule will increase the execution time.
Consider a new schedule, in which processor i broadcasts its message before processor i + 1 (this can be assumed without loss of generality, because all processors are identical). Let i be the "rst processor whose local time t i differs from the local time under the contention-free schedule.
Note that the local time t i and the communication time t i are related and any change in the number of iterations assigned to i will change both times for processor i. There are two cases:
1. The local time under the new schedule is longer than the local time under the contention-free schedule. The sum of the remaining communication times (including processor i), p k=i t k , is the same as in the contention-free schedule. In the contention-free schedule this was also the time left to the completion of the execution. In the new schedule, because t i has increased, we have to start communication for processor i later than in the contention-free schedule. The total time to complete is at least the same as that for the contention-free schedule so the execution time will be longer.
The new local time is shorter.
We are left with some extra work that processor i did in the contention-free schedule. If all processors j, such that j > i, have the same amount of work as they used to have under the contention-free schedule, this extra work will be left over. Hence, one of the remaining processors has to perform this additional work increasing the execution time.
We will show below an algorithm that will "nd a contention-free schedule if it exists. We can use Theorem 9.1 to simplify the formula for the completion time of the ith processor, T i = t i + t i . We can use this formula to "nd the optimal schedule. A schedule is de"ned by the set of z i , for 
The solution to this system of equations de"nes the optimal schedule.
Validity of the solution
We will now show that the above system of equations (1) has a unique solution. The coef"cient matrix of this system of equations is
where M p is a p × p matrix. Its determinant is equal to Note that we have not shown that the solution is always valid-a solution may contain negative z i . This corresponds to a set of parameters with a very high relative startup communication cost. However, if a contention-free schedule exists, the solution to the above system of equations will describe this optimal schedule.
In practice the solution is always positive. Let us consider an example with two processors:
hence z 1 is negative if and only if α > wn = (xγ + yβ )n. This condition would be true if the startup time for a send, α , was longer than all the computation in the loop, (xγ n). Clearly, we do not want to parallelize a loop like that in the "rst place.
It is also worth noting that, for a given solution, if z 1 > 0, then z j > 0 for j > i. This is true, because wz i = vz i−1 + α , that is, if z i−1 is positive, then z i is positive too.
EXPERIMENTAL EVALUATION
To verify the proposed scheduling techniques, we conducted experiments and measured the execution time and the speedup of several applications. Where appropriate, we also compare our approach with straightforward scheduling. The results of our experiments are encouraging. Our techniques show signi"cant performance improvements over traditional approaches.
The rest of this section is organized similarly to the whole paper. First we compare our approach to scheduling heterogeneous loops on homogeneous processors with the popular round-robin load-balancing technique. We did not run experiments for homogeneous loops on homogeneous processors, as scheduling those is easy and well understood. Then we give results for scheduling both homogeneous and heterogeneous loops on heterogeneous processors. The last part of this section gives results for our approach to contention avoidance. The experiments for calculating the NPS, and for scheduling in the presence of memory heterogeneity were presented in Section 4 and Subsection 7.2 respectively.
All our experiments were performed on a network of Sun workstations (SPARC 1, LX, 5 and 10), interconnected via an Ethernet LAN. PVM (parallel virtual machine) [28] , a message passing software system mainly intended for network-based distributed computing, was used to parallelize the applications. The latency obtained with PVM is ∼2414.5 µs and bandwidth is ∼0.96 Mb s −1 . We assume that there is no external load on the processors or network, i.e. the NOW is used in a dedicated user mode.
Applications
The applications used for our experiments are:
Matrix multiply (MXM). Multiplication of two square matrices. 2D-FFT. Two-dimensional fast Fourier transformation. Cholesky factorization (CHO). Find a lower triangular matrix L with positive diagonal elements such that A = L L T , where A is a dense symmetric positivede"nite matrix. Spatial price equilibrium modelling (ECO). A commodity trade model [8] : for a set of supply and demand markets with given tariffs, transportation costs, supply and demand price functions, this program "nds the amount of goods shipped between different markets. TRIANG. This is a synthetic program [13] , which has a loop nest with varying computation in each iteration of the outermost loop. Livermore Fortran kernel (LFK). Modi"ed loop 10 from Livermore Fortran kernels [13] .
Scheduling for heterogeneous programs
Triang [15] is a program with a heterogeneous loop used in the experiments presented in this section. Figure 6 shows the speedups for two different parallelizations of Triang. The label bitonic marks the results for the parallelization from Section 5. We compare our approach with the roundrobin scheduling. The round-robin technique schedules a doall loop on p processors by assigning iterations 0, 0 + In this experiment, we assume that the arrays are not distributed before and after the loop nest. Therefore, our timings include the time required to send out necessary data to all processors and to gather results from all participating processors. Because communication in a network of workstations is very expensive, the speedups are not close to the optimum. Figure 6 demonstrates that the bitonic schedule consistently outperforms the round-robin technique.
Scheduling for heterogeneous processors
We have chosen three applications to measure performance of scheduling in a heterogeneous environment. Matrix multiply and economics are examples of a homogeneous loop. Triang is an example of a heterogeneous loop.
Homogeneous loops: no communication
To "nd a static schedule on a network of heterogeneous computers, we use the normalized processor speeds, which are shown in Table 6 , for the MXM program.
We can use normalized speeds to compute a`speedup' for a heterogeneous machine con"guration. We can de"ne this generalized speedup to be a ratio of the uniprocessor execution time on the base processor to the execution time of the parallel program. We can also de"ne the`ideal' speedup for a particular con"guration to be the sum of normalized speeds of all processors in a given con"guration. The results for matrix multiply are given in Figure 7 and Table 7 . The program multiplies two square matrices of size 600 × 600. The con"guration column describes how many machines of a given type were used in the experiment. Type 1 (T 1 ) is SPARCstation 1, type 2 (T 2 ) is SPARCstation LX and type 3 (T 3 ) is SPARCstation 10. The architectureoblivious schedule assigns the same number of iterations to every processor. The architecture-conscious schedule assigns a number proportional to the processor speed.
As expected, the results show that the architectureconscious schedule is always better than the architectureoblivious one. For some con"gurations the difference is not signi"cant, for others it is very large. Intuitively, the slowest machine's execution time will dominate the time for the whole program. So, the con"guration with many fast machines and few slow ones will suffer most from architecture-oblivious scheduling. If, on the other hand, a con"guration contains mostly slow machines, architectureconscious scheduling will not improve the execution time signi"cantly.
There is one more interpretation for the sum of normalized speeds. It says how many base processors would be equivalent in speed to a particular con"guration. Note that this number can be fractional, so for example the "rst con"guration in Table 7 is equivalent to 2.85 base processors. We can use this observation to plot the speedup as a function of the number of processors. This approach gives a concise visualization of the parallel performance, but we should not overestimate its accuracy. In particular, there may be many different con"gurations with the same base processor equivalent, but their speedups may be different.
Homogeneous loops: with communication
The second example of the homogeneous loop case is a program for spatial price equilibrium modelling in economics, the ECO [8] application. This program has a set of parallel loops. However, parallelization of the program on a network of workstations is a non-trivial task, since communication is required across the loops and data has to be broadcast between the loops. Figure 8 shows the speedups obtained on a variety of heterogeneous con"gurations of machines. The architecture-conscious schedules consistently outperform the architecture-oblivious schedules. In spite of the large amount of communication in the program and the high cost of network communication, a satisfactory parallel performance was achieved.
Heterogeneous parallel loops
For the heterogeneous loop case, we parallelized the Triang program. We can verify the performance of the parallelized code by plotting the normalized speedups and speedups for homogeneous con"gurations that consist of base processors.
We can see that for bitonic scheduling ( Figure 9 ) the performance of the heterogeneous parallelization is close to the homogeneous case.
The comparison with the homogeneous case is an interesting metric. Since in most cases, the architecture-conscious heterogeneous scheduling would be much better than the naive architecture-oblivious scheduling, the architectureconscious homogeneous case provides an upper bound of how well a heterogeneous solution can perform. 
Scheduling for contention avoidance
We show experimental results for contention avoidance scheduling on the LFK program. The outermost loop is a doall loop and it is being parallelized. We assume, however, that for the next stage of computation the array PX must be broadcast to all processors. This causes a high level of contention in our Ethernet network. We can use the algorithm developed in Section 9 to maximize the speedup by minimizing contention. Figure 10 shows the performance of two parallelizations of the modi"ed LFK 10 nest. We can see that for a small number of processors contention is not a very big problem. But as the number of processors increases, performance of the simple parallelization deteriorates very quickly. The architecture-conscious schedule results in a signi"cantly faster program.
For this example the speedups achieved by the architecture-conscious schedule are not very good, which is generally true about programs with excessive communication. We may expect that if a program exhibits contention, the technique presented in Section 9 will improve its performance, but the speedup will always be signi"cantly worse than the optimum. The reasons for using this technique, even though it is inherently suboptimal, are:
• We get a better performance than the sequential program. If we need high performance at any cost, we may choose to parallelize such a code even if we know that the machine will be underutilized.
• Most real applications have many phases in the program. If most of the phases can be parallelized, then it is better for data to remain distributed. Therefore, parallelization of code fragments that do not display great parallelism/speedups is still necessary to maintain data locality and reduce communication in the later phases, since data would have to be on a single node if the code fragment were sequentialized.
CONCLUSIONS
In this paper, we looked at the general issues in heterogeneous computing, and we studied the problem of scheduling parallel loops at compile time for a heterogeneous network of machines. We proposed a simple yet comprehensive model for a network of processors. To model heterogeneous processors, we introduced the parameter, normalized processor speed, which is highly application dependent. To solve large problem instances, we need an estimate of the memory requirement of an application. We proposed a new estimate of this requirement, the resident memory size. Finally, we developed compiler algorithms for generating optimal and near-optimal schedules of loops for load balancing, communication optimizations and network contention, in the presence of program, processor, memory and network heterogeneity. Our experiments showed that the new techniques can signi"cantly improve the performance of parallel loops over existing techniques.
