In this paper, we address the problem of optimal distribution of computational tasks on a network of heterogeneous computers when one or more tasks do not fit into the main memory of the processors and when relative speeds vary with the problem size. We propose a functional performance model of heterogeneous processors that integrates many essential features of a network of heterogeneous computers having a major impact on its performance such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. Under this model, the speed of each processor is represented by a continuous function of the size of the problem whereas traditional models use single numbers to represent the speeds of the processors. We formulate a problem of partitioning of an nelement set over p heterogeneous processors using this model and design an algorithm of the complexity O(p × log 2 n) solving the problem.
Introduction
In this paper, we deal with the problem of optimal distribution of computational tasks across heterogeneous computers when one or more tasks do not fit into the main memory of the processors and when relative speeds vary with the problem size.
A number of algorithms of parallel solution of scientific and engineering problems on heterogeneous networks of computers (HNOCs) have been designed and implemented Quinn 1993, 1995; Beaumont et al. 2001b; Kalinov and Lastovetsky 2001) . They use different performance models of HNOCs but all the models represent the speed of a processor by a single positive number, and computations are distributed over the processors such that their volume is proportional to this speed of the processor. Cierniak et al. (1997) use the notion of normalized processor speed (NPS) in their machine model to solve the problem of scheduling parallel loops at compile time for HNOCs. NPS is a single number and is defined as the ratio of time taken to execute on the processor under consideration, with respect to the time taken on a base processor. In Beaumont et al. (2001b) and Petitet and Dongarra (1999) , normalized cycle-times are used, i.e. application dependent elemental computation times, which are computed via smallscale experiments (repeated several times, with an averaging of the results). Several scheduling and mapping heuristics have been proposed to map task graphs onto HNOCs (Tan et al. 1997; Maheswaran and Siegel 1998; Iverson and Ozguner 1998) . These heuristics employ a model of a heterogeneous computing environment that uses a single number for the computation time of a subtask on a machine. Yan, Zhang and Song (1996) use a two-level model to study performance predictions for parallel computing on HNOCs. The model uses two parameters to capture the effects of an owner workload. These are the average execution time of the owner task on a machine and the average probability of the owner task arriving on a machine during a given time step.
Thus traditional heterogeneous parallel and distributed algorithms implicitly assume that the relative speed of the processor does not depend on the size of the computational task solved by the processor. This assumption can be quite satisfactory if the code executed by the processors fully fits into the main memory. But as soon as the restriction is relaxed, it will not be true.
First of all, the processors may have significantly different sizes of main memory and the partitioning of the problem may result in some computational tasks not fitting into the main memory of the assigned processor. In this case, solution of the computational task of any fixed size does not guarantee accurate estimation of the relative speed of the processors. The point is that beginning from some problem size, the task of the same size will still fit into the main memory of some processors and stop fitting into the main memory of others, causing the paging and visible degradation of the speed of these processors. This means that their relative speed will start significantly changing in favor of non-paging processors as soon as the problem size exceeds the critical value.
Secondly, even if the processors of different architectures have almost the same size of main memory, they may employ different paging algorithms resulting in different levels of speed degradation for the task of the same size, which again means the change of their relative speed as the problem size exceeds the threshold causing the paging.
Thus, taking account of memory heterogeneity and the effects of paging significantly complicates the design of algorithms distributing computations in proportion with the relative speed of heterogeneous processors. One approach to this problem is just to avoid the paging as it is normally done in the case of parallel computing on homogeneous multi-processors. However, avoiding paging in local and global heterogeneous networks may not make sense because in such networks it is likely that one processor with paging will be running faster than other processors without paging. It is even more difficult to avoid paging in the case of distributed computing on global networks. There may not be a server available to solve the task of the size you need without paging.
Therefore, to achieve acceptable accuracy of distribution of computations across heterogeneous processors in the possible presence of paging, a more realistic performance model of a set of heterogeneous processors is needed. In this paper, we suggest a model where the speed of each processor is represented by a continuous function of the problem size. This model is application centric in the sense that, generally speaking, different applications will characterize the speed of the processor by different functions.
The rest of the paper is organized as follows. In Section 2, we present the functional performance model. In Section 3, we investigate the simple problem of optimal partitioning of an n-element set over p heterogeneous processors with the functional model and design an algorithm of its solution of the complexity O(p × log 2 n). We then apply this set partitioning algorithm to multiplication of large matrices on a cluster of heterogeneous computers. In Section 4, we present results of experiments with this application. Section 5 concludes the paper.
Functional Performance Model
Under the functional performance model, the speed of each processor is represented by a continuous function of the problem size.
The speed is defined as the number of computation units performed by the processor per one time unit. The model is application specific. In particular, this means that the computation unit can be defined differently for different applications. The important requirement is that the computation unit does not have to vary during the execution of the application. An arithmetical operation and the matrix update a = a + b × c, where a, b, and c are r × r matrices of fixed size r, give us examples of computation units.
The problem size is understood as a set of one, two or more parameters characterizing the amount and layout of data stored and processed during the execution of the computational task [as compared with the notion of problem size as the number of basic computations in the best sequential algorithm to solve the problem on a single processor (Kumar et al. 1994) ]. The number and semantics of the problem size parameters are problem-or even application-specific. It is assumed that the amount of stored data will increase with the increase of any of the problem size parameters.
For example, the size of the problem of multiplication of two dense square n × n matrices can be represented by one parameter, n. During solution of the problem, three matrices will be stored and processed. So the total number of elements to store and process will be 3 × n 2 . In order to compute one element of the resulting matrix, the application uses n multiplications and (n -1) additions. So, in total (2 × n -1) × n 2 arithmetical operations are needed to solve the problem. If n is large enough, the number can be approximated by 2 × n 3 . Alternatively, a combined computation unit, which is made up of one addition and one multiplication, can be used to express the volume of computation needed to multiply two large square n × n matrices. In this case, the total number of computation units will be approximately equal to n 3 . Therefore, the speed of the processor demonstrated by the application when solving the problem of size n can be calculated as n 3 (or 2 × n 3 ) divided by the execution time of the application. This gives us a function from the set of natural numbers representing problem sizes into the set of nonnegative real numbers representing speeds of the processor, f: N → R + . The functional performance model of the processor is obtained by continuous extension of function f: N → R + to function g: R + → R + (f(n) = g(n) for any n from N).
Another example is the problem of multiplication of two dense rectangular n × k and k × m matrices. The size of this problem is represented by three parameters, n, k, and m. The total number of matrix elements to store and process is (n × k + k × m + n × m). The total number of arithmetical operations needed to solve this problem is (2 × k -1) × n × m. If k is large enough, the number can be approximated by 2 × k × n × m. Alternatively, a com-bined computation unit, which is made up of one addition and one multiplication, can be used to express this volume of computation. In this case, the total number of computation units will be approximately equal to k × n × m. Therefore, the speed of the processor exposed by the application when solving the problem of size (n, k, m) can be calculated as k × n × m (or 2 × k × n × m) divided by the execution time of the application. This gives us a function, f: N 3 → R + , mapping problem sizes to speeds of the processor. The functional performance model of the processor is obtained by continuous extension of function f: N 3 → R + to function g: R + 3 → R + (f(n) = g(n) for any n from N 3 ). Thus, under the proposed functional model, the speed of the processor is represented by a continuous function of the problem size. Moreover, we can make some further assumptions about the shape of the function. Namely, we can realistically assume that along each of the problem size variables, either the function is monotonically decreasing, or there exists point x such that:
• On the interval [0, x], the function is • monotonically increasing, • concave, and
• any straight line coming through the origin of the coordinate system intersects the graph of the function in no more than one point. • On the interval [x, ∞), the function is monotonically decreasing.
We have conducted numerous experiments with diverse scientific kernels and computers, and in all the experiments the speed of the processor could be approximated accurately enough by a function satisfying the above assumptions (within the accuracy of measurements). Some typical observed shapes of the speed function are given in this paper. An alternative approach is to use a piecewise constant function in order to represent the dependence of the speed of the processor on the problem size (Drozdowski and Wolniewicz 2003) . There are at least two reasons behind the proposal to represent the speed of the processor by a continuous function of the problem size.
First of all, we want the model to adequately reflect the behavior of common, not very carefully designed applications. Consider the experiments with a range of applications using memory hierarchy in different ways that are presented by Lastovetsky and Twamley (2005) and shown in Figure 1 . Carefully designed applications ArrayOpsF and Matrix-MultAtlas, which efficiently use memory hierarchy, demonstrate quite a sharp and distinctive performance curve of dependence of the absolute speed on the problem size. For these applications, the speed of the processor can be approximated by a stepwise constant function of the problem size. At the same time, application MatrixMult, which implements a straightforward algorithm of multiplication of two dense square matrices and uses inefficient memory reference patterns, displays quite a smooth dependence of speed on the problem size. For such applications, the speed of the processor cannot be accurately approximated by a stepwise constant function. It should be approximated by a continuous function of the problem size if we want the performance model to be accurate enough.
The other main motivation is that we target common heterogeneous networks rather than dedicated high performance computer systems. A computer in such a network is persistently performing some minor routine computations and communications just as an integrated node of the network. Examples of such routine applications include e-mail clients, browsers, text editors, audio applications, etc. As a result, the computer will experience constant and stochastic fluctuations in the workload. This changing transient load will cause a fluctuation in the speed of the computer in the sense that the execution time of the same task of the same size will vary for different runs at different times. The natural way to represent the inherent fluctuations in the speed is to use a speed band rather than a speed function. The width of the band characterizes the level of fluctuation in the performance due to changes in load over time. The shape of the band makes the dependence of the speed of the computer on the problem size less distinctive and sharp even in the case of carefully designed applications efficiently using the memory hierarchy. Therefore, even for such applications the speed of the processor can be realistically approximated by a continuous function of the problem size. Figure 2 shows experiments conducted with application MatrixMultATLAS on a set of computers whoses speci- fications are shown in Table 1 . The performance bands are obtained using the procedure given by Lastovetsky, Reddy and Higgins (2006) . The application employs the level-3 BLAS routine dgemm (Dongarra et al. 1990) supplied by Automatically Tuned Linear Algebra Software (ATLAS; Whaley, Petitet and Dongarra 2000) . ATLAS is a package that generates efficient code for basic linear algebra operations. The computers have varying specifications and varying levels of network integration and are representative of the range of computers typically used in networks of heterogeneous computers.
The problem of optimally scheduling divisible loads has been studied extensively and the theory is commonly referred to as divisible load theory (DLT). The main features of earlier works in DLT (Bharadwaj et al. 1996; Drozdowski and Wolniewicz 2003b) are that they assume distributed systems with a flat memory model and use a mathematical model where the speed of the processor is represented by a constant. Drozdowski and Wolniewicz (2003a) propose a new mathematical model that relaxes the above two assumptions. They study distributed systems, which have both the hierarchical memory model and a piecewise constant dependence of the speed of the processor on the problem size. However, the model they formulate is targeted mainly towards optimal distribution of arbitrary tasks for carefully designed applications on dedicated distributed multiprocessor computer systems, whereas our model is aimed towards optimal distribution of arbitrary tasks for any arbitrary application on common heterogeneous networks.
Distributing Independent Chunks
In this section, we study the problem of distributing independent chunks of computations over a unidimensional arrangement of heterogeneous processors. The form of presentation is very much inspired by that used in Beaumont et al. (2001a) to present the same problem but for heterogeneous processors whose performance is characterized by constants.
The problem is formulated as follows: Given n independent chunks of computations, each of equal size (i.e. each requiring the same amount of work), how can we assign these chunks to p (p < n) physical processors P 1 , P 2 , ..., P p of respective speeds s 1 (x), s 2 (x), ..., s p (x) so that the workload is best balanced? Here, the speed of the processor is understood as the number of computation chunks performed by the processor per one time unit. The speed depends on the number of chunks assigned to the processor and is represented by a continuous function s: R + → R + . How, then, do we distribute chunks to processors? The intuition says that the load x i of P i should be proportional to s i (x i ). Since the load (i.e. numbers of chunks) on each processor must be integers, we use the following two-step algorithm to solve the problem. Let n i denote the number of chunks allocated to processor P i . Then, the overall execution time obtained with allocation (n 1 , n 2 , …, n p ) is given by max i . The optimal solution minimizes the overall execution time.
Algorithm 1. Optimal distribution for n independent chunks over p processors of speeds s 1 (x), s 2 (x), ..., s p (x):
• Step 1. Initialization: We approximate the n i so that and .
Namely, we find n i such that either or n i = for where . 
Step 2. Refining: We iteratively increment some n i until n 1 + n 2 + … + n p = n.
Approximation of the n i (Step 1) is not as easy as in the case of constant speeds s i of the processors, when n i can be approximated as (see Beaumont 2001a) .
The algorithm which we propose is based on the following observation: If , then all the points (x 1 , s 1 (x 1 )), (x 2 , s 2 (x 2 )), …, (x p , s p (x p )) lie on a straight line passing through the origin of the coordinate system, being intersecting points of this line with the graphs of the speed functions of the processors. This is shown in Figure 3 . Our algorithm is seeking for two straight lines passing through the origin of the coordinate system so that:
• The "ideal" optimal line (that is, the line, which intersects the speed graphs in points (x 1 , s 1 (x 1 )), (x 2 , s 2 (x 2 )), …, (x p , s p (x p )) such that and
x 1 + x 2 + … + x p = n) lies between the two lines. • There is no more than one point with integer x coordinate on either of these graphs between the two lines.
Algorithm 1.1. Approximation of the n i so that either or for where and :
1. The upper line U is drawn through the points (0, 0) and , and the lower line L is drawn through the points (0, 0) and
, as shown in Figure 4 .
2. Let and be the coordinates of the intersection points of lines U and L with the function s i (x) Then Algorithm 1.1 finds the n i such that either or for where and x 1 + x 2 + … + x p = n.
Proof. First we formulate a few obvious properties of the functions s i (x). Since s i (x) are continuous and bounded, the initial lines U and L always exist. Since there is no more than one point of intersection of the line L with each of s i (x), L will make a positive angle with the x-axis. Thus, both U and L will intersect each s i (x) exactly in one point. Let and be the coordinates of the intersection points of U and L with s i (x) ( ) respectively. Then by design, ≤ n ≤ . This invariant will hold after each iteration of the algorithm. Indeed, if line M bisects the angle between lines U and L, then ∠(L, X) ≤ ∠(M, X)
If ≤ n, then ≤ ≤ n ≤ and after step 4 of the algorithm ≤ n ≤ . If ≥ n, then ≤ n ≤ ≤ and after step 4 of the algorithm ≤ n ≤ . Thus, after each iteration of the algorithm, the "ideal" optimal line O such that = n will be lying between lines U and L.
When the algorithm reaches step 5, we have for all , which means that the interval contains at most one integer value. Therefore, either or .
Algorithm 1.2. Iterative incrementing of some n i until :
1. If then go to step 2 else stop the algorithm.
2. Find such that . 3. . Repeat step 1.
Note. It is worth stressing that Algorithm 1.2 cannot be used to search for the optimal solution beginning from an arbitrary approximation n i satisfying inequality n 1 + n 2 + … + n p < n, but only from the approximation found by Algorithm 1.1.
Proposition 2. Let the functions s i (x) (
) satisfy the conditions of Proposition 1. Let (n 1 , n 2 , …, n p ) be the approximation found by Algorithm 1.1. Then Algorithm 1.2 gives the optimal allocation.
The execution time obtained with allocation (n 1 , n 2 , …, n p ) is given by . The geometrical interpretation of this formula is as follows. Let M i be the straight line connecting the points (0,0) and (n i , s i (n i )).
Then . Therefore, minimization of is equivalent to maximization of . Let {S 1 , S 2 , …} be the set of all straight lines such that:
• S k connects (0,0) and (m, s i (m)) for some and some integer m, • S k lies below M i for any .
Let {S 1 , S 2 , …} be ordered in the decreasing order of . The execution time of the allocation (n 1 , n 2 , …, n p ) is represented by line M k such that . Incrementing of any n i means moving one more line from the set {S 1 , S 2 , …} into the set of lines representing the allocation. At each step of the incrementing, Algorithm 1.2 moves the line making the largest angle with the x-axis. This means that after each increment the algorithm gives the optimal allocation (n 1 , n 2 , …, n p ) under the assumption that the total number of chunks, which should be allocated, is equal to n 1 + n 2 + … + n p (any other increment gives a smaller angle and, hence, longer execution time). Therefore, after the last increment the algorithm gives the optimal allocation (n 1 , n 2 , …, n p ) under the assumption that n 1 + n 2 + … + n p = n.
Complexity
In this section, we estimate the complexity of Algorithm 1. We start with the complexity of Algorithm 1.1.
Definition. The heterogeneity of the set of p physical processors P 1 , P 2 , ..., P p of respective speeds s 1 (x), s 2 (x), ..., s p (x) is bounded if and only if there exists a constant c such that where s max (x) = max i s i (x) and s min (x) = min i s i (x).
Proposition 3. Let the functions s i (x) ( ) satisfy the conditions of Proposition 1 and the heterogeneity of processors P 1 , P 2 , ..., P p be bounded. Then, the complexity of Algorithm 1.1 is .
Proof. First, we estimate the complexity of one iteration of Algorithm 1.1. At each iteration we need to find the points of intersection of p graphs y = s 1 (x), y = s 2 (x), ..., y = s p (x) and a straight line y = a × x. In other words, at each iteration we need to solve p equations of the form a × x = s i (x). As we need the same constant number of operations to solve each equation, the complexity of this part of one iteration will be O(p). The test for stopping (step 2 of the algorithm) also takes a constant number of operations per function s i (x) making the complexity of this part of one iteration O(p). Therefore, overall the complexity of one iteration of Algorithm 1.1 will be O(p).
Next, we estimate the number of iterations of this algorithm. To do it, we use the following lemma that states one important property of the initial lines U and L obtained at the step 1 of Algorithm 1.1.
Lemma 3.1. Let the functions s i (x)
satisfy the conditions of Proposition 1 and the heterogeneity of processors P 1 , P 2 , ..., P p be bounded. Let O be the point (0, 0), A i be the point of intersection of the initial line U and s i (x), and B i be the point of intersection of the initial line L and s i (x). Then, there exist constants c 1 and c 2 such that for any .
Proof of Lemma 3.1. The full proof of Lemma 3.1 is technical and very lengthy. Here, we give a relatively compact proof of the lemma under the additional assumption that the functions s i (x) are monotonically decreasing. First, we prove that there exist constants c 1 and c 2 such that where A is the point of intersection of the initial line U and s max (x) = max i s i (x), and B is the point of intersection of the initial line L and s max (x) (see Figure 5 ). Since the heterogeneity of the processors P 1 , P 2 , ..., P p is bounded, there exists a constant c such that . In particular, this means that and . Let us prove that . This proves Lemma 3.1. Bisection of the angle at the very first iteration will divide the segment A i B i of the graph of the function s i (x) in the proportion (see Figure 6 ). . Indeed, let and . We have . Therefore, and . Hence, . 
means that after this bisection, at least ∆ × 100% of the possible solutions will be excluded from consideration for each processor P i . The difference in length between OB i and OA i will be getting smaller and smaller with each next iteration. Therefore, no less than ∆ × 100% of the possible solutions will be excluded from consideration after each iteration of Algorithm 1.1. The number of possible solutions in the initial set for each processor P i is obviously less than n. The constant ∆ does not depend on p or n (actually, this parameter just characterizes the heterogeneity of the set of processors). Therefore, the number of iterations k needed to arrive at the final solution can be found from the equation (1 -∆) k × n = 1, and we have k = . Thus, the overall complexity of Algorithm 1.1 will be O(p × log 2 n). Proposition 3 is proved.
Note. The low complexity of Algorithm 1.1 is mainly due to the bounded heterogeneity of the processors. This very property guarantees that each bisection will reduce the space of possible solutions by a fraction lower bounded by some finite positive number independent on n.
The assumption of bounded heterogeneity will be inaccurate if the speed of some processors becomes too slow for large n, effectively approaching 0. One approach to this problem is to use a relaxed functional model where the speed of the processor is represented by a continuous function until some given size of the problem and by zero for all sizes greater than this one. Data partitioning algorithms with that model are presented by Lastovetsky and Reddy (2005) . The other approach is to use algorithms not sensitive to the shape of performance functions such as the algorithm of complexity O(p 2 × log 2 n) presented in Lastovetsky and Reddy (2004) .
Proposition 4. Let the functions s i (x)
satisfy the conditions of Proposition 1 and the heterogeneity of processors P 1 , P 2 , ..., P p be bounded. Then, the complexity of Algorithm 1 is O(p × log 2 n).
Proof. If (n 1 , n 2 , …, n p ) is the approximation found by Algorithm 1.1, then n -2 × p ≤ n 1 + n 2 + … + n p ≤ n and Algorithm 1.2 gives the optimal allocation in at most 2 × p steps of increment, so that the complexity of Algorithm 1.2 is O(p 2 ). This complexity is given by a naïve implementation of Algorithm 1.2. The complexity of this algorithm can be reduced down to O(p × log 2 p) by using ad hoc data structures (Beaumont et al. 2001a ). Thus, overall the complexity of Algorithm 1 will be O(p × log 2 p + p × log 2 n) = O(p × log 2 (p × n)). Since p < n, then log 2 (p × n) < log 2 (n × n) = log 2 (n 2 ) = 2 × log 2 n. Thus, the overall complexity of Algorithm 1 will be bounded by O(2 × p × log 2 n) = O(p × log 2 n).
Application of the Partitioning Algorithm
In this section, we apply the set partitioning algorithm to a matrix multiplication application using horizontal striped partitioning of matrices on a network of p heterogeneous computers. Our main aim is not to show how matrices can be efficiently multiplied but to explain in simple terms how the set partitioning algorithm using the functional model can be applied to optimally schedule computational tasks on networks of heterogeneous computers.
The matrix multiplication application shown in Figure 7 (a) multiplies matrix A and matrix B, i.e. implementing matrix operation C = A × B, where A, B , and C are dense square n × n matrices. The matrices A and C are horizontally sliced such that the number of elements in a slice is proportional to the speed of the processor owning the slice. All the processors contain all the elements of matrix B. We assume one process per processor configuration.
For this application, the absolute speed of the processor is obtained based on multiplication of two dense matrices of size n 1 × n and n × n respectively to obtain a resultant matrix of size n 1 × n as shown in Figure 7(b) . The size of the problem is represented by two parameters, n 1 and n. The total number of matrix elements to store and process is (2 × n 1 × n + n × n). We use a combined computation unit, which is made up of one addition and one multiplication, to express the volume of computation. If n is large enough, the total number of computation units needed to solve this problem will be approximately equal to n 1 × n × n. Therefore, the speed of the processor exposed by the application when solving the problem of size (n 1 , n) can be calculated as n 1 × n × n divided by the execution time of the application. This gives us a function, f: N 2 → R + , mapping problem sizes to speeds of the processor. The functional performance model of the processor is obtained by continuous extension of function f: N 2 → R + to function g: R + 2 → R + (f(n,m) = g(n,m) for any (n,m) from N 2 ). A practical procedure to build the functional performance model of a processor is given by Lastovetsky, Reddy and Higgins (2006) . In brief, the algorithm presented by Lastovetsky, Reddy and Higgins (2006) exploits historic records of workload fluctuations of the processor in order to minimize the number of experimental points needed to accurately approximate the performance band by a piecewise linear function fitting within the band.
The speed function is geometrically represented by a surface as shown in Figures 8(a) and 8(b) for two processors X1 and X5 used in experiments and whose speci-
fications are shown in Table 2 . Figure 8(c) shows the geometrical representation of the relative speed of these two processors calculated as the ratio of their absolute speeds. One can see that the relative speed varies significantly depending on the value of variables n 1 and n.
When partitioning a square n × n matrix, we use the fact that the width of partitions is fixed and equal to n. Firstly, we section the surfaces representing the absolute speeds of the processors by the plane parallel to the axis representing the parameter n 1 and parallel to the axis rep- resenting the absolute speed of the processor and having an intercept of n on the axis representing the parameter n. This is illustrated in Figure 8(d) for two surfaces representing the absolute speeds of the processors X1 and X5. In this way we obtain a set of p curves on this plane that represent the absolute speeds of the processors against variable n 1 given parameter n is fixed. Then we apply the set partitioning algorithm to this set of p curves to obtain optimal distribution of slices in matrices A and C.
Experimental results
A small heterogeneous local network of 8 different Solaris and Linux workstations shown in Table 2 is used in the experiments. The network is based on 100 Mbit Ethernet with a switch enabling parallel communications between the computers. The amount of memory, which is the difference between the main memory and free main memory shown in the tables, is used by the operating system processes and a few other user application processes that perform routine computations and communications such as e-mail clients, browsers, text editors, audio applications etc. These processes use a constant percentage of CPU. Figure 9 shows the speedup of using the network of heterogeneous computers shown in Table 2 over the most powerful computer X1. The speedup calculated is the ratio of execution time of the serial matrix multiplication application using X1 over the execution time of the parallel matrix multiplication application, described in Section 3.2, using the functional model on the network of heterogeneous computers. Figure 10 shows the speedup of the matrix multiplication application, described in Section 3.2, executed on this network using the functional model over the matrix multiplication using the single number model. In the figures, for each problem size, the speedup calculated is the ratio of the execution time of the application using the single number model over the execution time of the application using the functional model.
We consider three cases for comparing the functional model with the single number model in the range (1000, 10 000) of matrix sizes. For the first case the single number model uses speed obtained by multiplying matrices of sizes 500 × 1000 and 1000 × 1000. This case covers the range of small sized matrices. The single number model for the second case uses speed based on multiplication of matrices of sizes 2500 × 5000 and 5000 × 5000. This case covers the range of medium sized matrices. For the third case, speed obtained by multiplying matrices of sizes 4000 × 8000 and 8000 × 8000 is used. This case covers the large sized matrices. The ratios of speeds of the most powerful computer X1 and the least powerful computer X5 in these cases are 13.5, 3.75, and 57.0 respectively.
It can be seen from the figure that the single number model in the first case performs poorly in the range of medium sized to large sized matrices. In the second case the single number model does not perform well for small sized and large sized matrices. In the third case the single number model does not perform well in the range of small sized and medium sized matrices and for large sized matrices with problem size greater than (8000,8000). Therefore the functional model performs better than the single number model for a network of heterogeneous computers when one or more tasks do not fit into the main memory of the processors and when relative speeds vary with the problem size. It can be concluded that our set partitioning algorithm using the functional model performs better for all sizes of matrices. Fig. 9 Speedup of matrix multiplication application using the network of heterogeneous computers shown in Table 2 over the matrix multiplication application using the most powerful computer X1.
Conclusion
In this paper, we addressed the problem of optimal distribution of computational tasks on a network of heterogeneous computers when one or more tasks do not fit into the main memory of the processors and when relative speeds vary with the problem size. We have proposed and analyzed the functional performance model of heterogeneous processors. This model integrates many essential features of a network of heterogeneous computers having a major impact on its performance such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. Under this model, the speed of each processor is represented by a continuous function of the size of the problem whereas traditional models use single numbers to represent the speeds of the processors. We have formulated a problem of partitioning of an nelement set over p heterogeneous processors using this model and designed an algorithm of the complexity O(p × log 2 n) solving the problem. Some early results on the functional model and data partitioning with this model were presented by Lastovetsky and Reddy (2004) . Fig. 10 The speedup of the parallel matrix multiplication application using the functional model over single number model on the network of heterogeneous computers shown in Table 2 . The speeds used in the single number model in the three curves for comparison are obtained using serial matrix multiplication of matrices of problem sizes (n 1 , n) = (500, 1000), (2500, 5000), and (4000, 8000) respectively.
