Abstract-The problem of optimal divisible load distribution in distributed bus networks employing a heterogeneous cluster of processors is addressed. The objective is to minimize the total processing time of the entire load subject to the communication and computation delays. In the mathematical model we adopt, both the granularity of the load fractions and all the associated overheads (also referred to as start-up costs) in the process of communication and computation, are considered explicitly in the problem formulation. We introduce a directed flow graph model for representing the load distribution process. This representation is novel to this literature. With this model, we first derive a closed-form solution for an optimal processing time. We propose an integer approximation algorithm and derive ultimate performance bounds for the class of homogeneous networks. We then extend the problem to a special class of application problems in which the data partitioning is restricted to a finite number of partitions. For this case, we present a recursive procedure to obtain optimal processing time. We then present two different integer approximation algorithms-PIA and IIA that could generate integer load fractions and yield suboptimal solutions. The choice of these algorithms are also analyzed. All the results are extended to a class of homogeneous networks to obtain ultimate performance bounds. Several illustrative examples are provided for ease of explanation.
Suboptimal Solutions Using Integer Approximation
Techniques for Scheduling Divisible Loads on Distributed Bus Networks Bharadwaj Veeravalli, Member, IEEE, and N. Viswanadham, Fellow, IEEE Abstract-The problem of optimal divisible load distribution in distributed bus networks employing a heterogeneous cluster of processors is addressed. The objective is to minimize the total processing time of the entire load subject to the communication and computation delays. In the mathematical model we adopt, both the granularity of the load fractions and all the associated overheads (also referred to as start-up costs) in the process of communication and computation, are considered explicitly in the problem formulation. We introduce a directed flow graph model for representing the load distribution process. This representation is novel to this literature. With this model, we first derive a closed-form solution for an optimal processing time. We propose an integer approximation algorithm and derive ultimate performance bounds for the class of homogeneous networks. We then extend the problem to a special class of application problems in which the data partitioning is restricted to a finite number of partitions. For this case, we present a recursive procedure to obtain optimal processing time. We then present two different integer approximation algorithms-PIA and IIA that could generate integer load fractions and yield suboptimal solutions. The choice of these algorithms are also analyzed. All the results are extended to a class of homogeneous networks to obtain ultimate performance bounds. Several illustrative examples are provided for ease of explanation.
Index Terms-Communication delays, directed graph, divisible loads, granularity, integer approximation, processing time.
I. INTRODUCTION

R
ESEARCH in the area of divisible load theory (DLT) has demonstrated a significant impact both in terms of the contributions to the domain of parallel and distributed scheduling literature, as well as in terms of the applicability of the results to a wide range of research disciplines [2] - [8] , [10] , [11] . This multidisciplinary research has attracted a large pool of researchers since 1988 through the first introductory work on bus based multiprocessor systems by Cheng and Robertazzi [2] . The main attraction is due to its simplicity in formulating and modeling the problem in a linear fashion, at the same time producing a richer set of results that elicits many interesting properties for Manuscript received June 20, 1999 ; revised September 23, 2000. This work was supported in part by the Department of Electrical and Computer Engineering, The National University of Singapore, under research Grant R-263-000-073-112. This paper was recommended by Associate Editor K. Pattipati.
V. Veeravalli is with the Department of Electrical and Computer Engineering, The National University of Singapore, Singapore 119260 (e-mail: elebv@nus.edu.sg).
N. Viswanadham is with the Department of Mechanical and Production Engineering, The National University of Singapore, Singapore 119260 (e-mail: mpenv@nus.edu.sg).
Publisher Item Identifier S 1083-4427(00)10875-6.
network based scheduling schemes. The mathematical model for the communication and computation delays are assumed to be proportional to the size of the load that the link carries and the processor computes. The processing load in this domain is assumed to be arbitrarily divisible with no precedence relationships between the partitioned fractions. Also, they are computationally intensive and hence, partitioning and scheduling the entire load among the available set of processors on the network minimizes the overall finish time. However, owing to the communication and computation delay constraints on the links and the processors, the partitioning schemes that are usually practiced in the parallel processing literature do not produce optimal, if not acceptable performance, in scheduling these divisible loads. The performance metric [13] considered in these minimization problems is the total processing time of the entire load. The primary objective of this theory is to decide on how to partition the available load and schedule among the processors so that the processing time is minimized.
We will now present a very brief summary on some of the relevant recent literature in DLT. In the literature so far, the above mentioned problem has been studied in terms of architectural variations such as , single-level tree networks [3] , [7] , linear networks [4] , [14] , [17] , mesh [5] , [6] , hypercubes [6] , bus [2] , [7] , [8] , etc. A compilation of the results until 1995 appears in [7] . Also, issues related to multiprocessor systems such as fault-tolerance [16] , scheduling loads subject to the availability of processors, referred to as release times [19] , etc are also studied. The time varying nature of the processor and channel speeds is considered by Jeeho et al. [19] in the design and analysis of load distribution strategies. In [20] , [21] , using additive overhead factors that influence the communication and computation times, a closed-form solution for an optimal processing time is derived and the effect of load sequencing is rigorously carried out. Also, in [20] , [21] , an integer approximation technique is proposed and ultimate time performance bound is derived. We have presented that algorithm in Section III in this paper for the purpose of continuity. However, very recently the trend in this domain has been geared toward tackling some realistic applications that are often found in practice. Very recently the problem of matrix-vector product computations for very large size matrices are analyzed using the strategies developed in DLT by Ghose et al. [8] . Alternate to this processing time minimization problem, Jeeho et al. [15] considered the minimization problem from monetary cost perspective. This problem considers the minimization of the total monetary cost incurred in scheduling divisible loads on bus networks.
A. Research Contributions
Our research contributions in this paper are multifold and are novel to this literature in DLT. This research is an attempt to bring theresultsofthetheorysofarclosertopracticebytuningthemodel and in proposing a load distribution strategy and solution that can be realized in practice. This contribution precisely gives an estimate of how far the solution proposed by the theory will lie in comparison with the realistic situation. To this end, we tackle two different, but closely related problems. We first introduce a directed flow graph (DFG) representation to describe the load distribution process. We believe that this directed flow graph is a generalized representation. This is the first time in this domain that such a representation is introduced. This representation provides complete flexibility in representing any complex schedule and also to analyze its performance. Apart from serving as an alternative tool to the timing diagram representation [7] , this representation is simple to use and deriving an optimal solution is less cumbersome. We use this representation throughout this paper. We derive a closed-form solution for an optimal processing timeusingamathematicalmodelthataccountsforalltheoverhead componentsthatpenalizetheperformance.Weshowtheimpactof these overheads that are present in reality in scheduling divisible loads. Though this model was considered earlier in the literature [6] , closed-form solutions, load sharing conditions, and the impact of these overheads on the performance were not analyzed. Secondly, for a class of applications that demand the load to be dividedintoafinitenumberofpartitions,wefirstderiverealvalued optimal load distribution and the processing time.
Thirdly, from an applications perspective, in general, the processing load may not be truly arbitrarily divisible. An example of this scenario would be in computing a large size matrix-vector product on parallel and distributed systems [8] or in any image processing application [13] . A large size matrix or a vector may be partitioned for the ease of complex computations. In this case, the arbitrarily divisible property no longer holds as the load that will be distributed among the processors will be in terms of number of rows orcolumnsandnotasafractionoftherowsorcolumns.Thus, if our model can be tuned to fit this requirement, then the strategies proposedbythistheorywouldsuitanyapplication.Forinstance,in theabovementionedapplications,usuallytheloadthatisassigned toaprocessorwillbeintermsofcertainnumberofrowsorcolumns and hence, we can say that the load that is assigned to a processor will be an integral multiple of some fundamental quantity. This fundamental quantity is referred to as the divisibility factor and is defined as the minimum possible granularity of any load fraction that can be assigned to a processor. In practice, most of the times the choice of an appropriate grain size (granularity) during the runtime becomes an important factor, as the effective speed of the network and the processors may vary due to several effects. Thus, one of the main concerns of this paper is to propose some integer approximation techniques to generate integer valued sizes of the load. One may employ a simple rounding off procedure, however, in this case, there is no guarantee that the performance will be withinacceptablelimitsandalsoitisdifficulttoquantifybymeans of some performance bounds. Our proposed algorithms follow a systematic way of generating the integer valued load fractions and guarantee that the time performance converges within an acceptable radius from the optimal solution for the case of homogeneousbusnetworks.Thisisthefirsttimethattheeffectofinteger approximationtechniques are studiedin the DLT literature.
The paper is organized as follows. Section II introduces the problem and describes the load distribution strategy. It also introduces the DFG representation for scheduling and derives a closed-form solution for the processing time. Section III proposes an integer approximation algorithm and obtains an ultimate time performance bound. Section IV tackles another closely related problem in which partitioning the load is restricted to a finite number. Also, in this section, we propose two different integer approximation algorithms and obtain ultimate performance bounds. Section V discusses the results obtained and Section VI concludes the paper.
II. PROBLEM SETTING AND SOME REMARKS
We consider a bus network architecture as shown in Fig. 1 . The networkmayormaynothaveadedicatedcontrolprocessororabus controller unit (BCU) [7] to distribute the load among the processors. In this paper, we shall present a rigorous analysis of the time performanceofthesystemforthecasewhenthenetworkhasadedicated BCU to distribute the entire load. It is worth mentioning at this stage that a bus network is equivalent to a single level tree network or a star network when all the links have identical speeds [3] , [7] . Thus, all the research contributions in this paper also hold for this special class of single-level tree networks.
A. Load Distribution Strategy and Some Definitions
The load distribution strategy is described as follows. The divisible load is assumed to originate at the BCU. The BCU divides the load into load fractions, denoted as , and distributes them among all the -processors in a particular sequence, say , one after other. Upon receiving their respective load fractions, the processors start computing their respective load fractions. The problem is then to determine the optimal sizes of these load fractions that are assigned to the processors such that the total processing time is a minimum. This strategy is referred to as single installment strategy in the literature [7] . We now introduce some notations that will be used throughout the paper.
Fraction of the load assigned to processor . The inverse of the computation speed of processor . Time taken to process a unit load by the standard processor. An additive computation overhead component that includes the sum of all delays associated with the computation process. The inverse of the communication speed of the link. Time taken to transmit a unit load by the communication link. An additive communication overhead component that includes the sum of all delays associated with the communication process. We shall denote the product as and as , respectively throughout the paper. Thus, using the above notations, we see that the communication time of a fraction of the load is given by and the computation time of this load fraction by is given by, [2] - [19] .
B. Directed Flow Graph
The load distribution process is described by means of a directed flow graph as shown in Fig. 2 . The figure shows two types of nodes. The first level of nodes are referred to as communication nodes and the other nodes are referred to as computation nodes. The weight of a communication node is given by and the weight of the computation node is given by . The directed arrows between the adjacent communication nodes and represent the fact that the communication of the load fraction to will start only after the communication of the fraction to is completed. Similarly, the directed arrows between a communication node and the computation node denotes the fact that the computation of by starts only after receiving the entire load from BCU. These directed arrows represent the causal precedence relationships between the events. Thus, we see that this directed flow graph represents the load distribution process completely. The main advantage of this representation is its simplicity. In the DLT theory, as a principle of optimality, an optimal solution is obtained when all the participating processors stop computing at the same time. This fact is also captured in this flow graph by equating the finish time paths (defined in Section II-B) and solving the recursive equations. The timing diagram representation can be elegant and easy to visualize only for simple load distribution strategies. Using a timing diagram to represent an optimal schedule for complex strategies, like multi-installment strategy, is extremely difficult and time consuming. As mentioned earlier, we use this representation to derive optimal processing time throughout the paper.
C. Some Definitions
We shall now define the following. i) Load distribution, denoted by , defined as an -tuple ( ) such that and . The equation is referred to as normalization equation, where is the total load. Let the space of all possible load distributions be denoted as . ii) Finish time path of a processor , denoted as , is the sum of the weights of the nodes starting from the communication node 1 till computation node , along the directed arrows. iii) Critical path, denoted as , and is given by , where is as defined above. This is the longest finish time path in the graph and represents the time at which the entire load is processed. iv) Optimal path, denoted as , which is the minimum processing time to finish processing the entire load, i.e., . It has been rigorously proved in the literature [7] that for optimal processing time all the finish time paths of the processors must be equal. Here too, we shall use this optimality criterion to analyze the processing time performance of the system.
D. Closed-Form Solutions for the Optimal Processing Time
In this section, we shall derive a closed-form expression for an optimal processing time by assuming that the sequence of load distribution is from to in that order. Further, we shall show that the presence of such overheads, which are inherently present in any realistic system, will drastically affect the time performance and may lead to different design decisions. This is crucial whenever the optimality of the solution and related trade-off studies are important in the design. A detailed discussion is presented in Section VI. Throughout this section, this sequence of load distribution will be referred to as a fixed sequence.
From Fig. 2 , for an optimal solution, equating the finish time paths and , we obtain the following recursive equations:
(1)
We rewrite (1) as (2) where , and , for all . Now, expressing each of these load fractions in terms of , we obtain,
where
Thus, from (3) we have linear equations with variables, and together with the normalization equation, we have equations. These equations can be solved to obtain the individual load fractions. Now, using (3) in the normalization equation, we obtain as (6) where
Substituting (6) in (3), we obtain the individual load fractions. From Fig. 2 , we obtain the expression for the optimal path as
Using in (3) and substituting in (9), we obtain (10) where , and are as defined above. Thus, in the above analysis, we have obtained an optimal solution involving processors by solving the set of recursive equations as shown above. Following the above steps, we can derive the optimal processing time for a system of processors with a BCU without overheads ( ). The optimal processing time in this case, is given by (11) where and are as defined above. It is worth mentioning at this juncture that given a -processor system, with the inclusion of all the overheads, it may not be necessary that an optimal solution exists when one attempts to utilize all the -processors. This behavior can be seen from Fig. 3(a) . We have used in generating these performance curves. Here, we observe that as we tend to increase the number of processors, the processing time decreases. Also, we have shown the influence of [given by (7)], which we will refer to as the overhead factor, in Fig. 3(b) . From this figure, we see that the overhead factor also increases as increases. Also, we observe that for the speed and overhead parameters chosen, the maximum number of processors that can be utilized with the given sequence of load distribution is . Thus, beyond this , optimal solution ceases to exist. This is due to the fact that the value of the overhead factor becomes greater than 1, and hence, there will not be any gain in the time performance even if we attempt to utilize more processors beyond . Therefore, as long as the value of is less than , we can utilize all the processors to process the entire load in a minimum amount of time with this fixed sequence.
Thus, we see that a necessary and sufficient condition to obtain an optimal processing time using all the -processors is given by (12) The above condition is also referred to as load sharing condition in the literature. What we observe here is a restricted monotonic nature of the processing time behavior as opposed to the case when no overheads were considered in the problem formulation [7] . Given a fixed sequence, whenever an optimal solution using a k-processor system ceases to exist, then we may utilize a maximal subset of , processors in the same sequence involving processors to obtain the optimal processing time. Of course, again we have to check the condition (12) for this set of processors.
Another observation that one could make from the Fig. 3 (a) is on the rate at which the decrease in the time performance is achieved. Beyond or , we see that the rate of decrease of the processing time is not significant and hence one can utilize or 8 processors without having to utilize all the processors till . This in a way reduces the overhead processing by the system considerably, as it can be seen by the increase in the values of for and , respectively. In [20] , [21] , rigorous analysis on the influence of this overhead factor is carried out and the effect of load sequencing is also analyzed.
Remarks: As mentioned in Section I, from a practical perspective, a divisible load, in general, may not be truly arbitrarily divisible. An example of this scenario would be in computing a large size matrix-vector product on parallel and distributed systems [12] , [13] or in any image processing application [9] . For instance, in the above mentioned applications, usually the load that is assigned to a processor will be in terms of certain number of rows or columns. Hence, we can say that the load that is assigned will be an integral multiple of some fundamental quantity, referred to as the divisibility factor and is defined as the minimum possible granularity of any load fraction that can be assigned to a processor. This is denoted as . Further, we assume that the total load contains units of load. In other words, the total load is , being a variable and a constant. Without loss of generality, all the subsequent results presented in this paper assume .
III. INTEGER APPROXIMATION ALGORITHM FOR THE SINGLE INSTALLMENT STRATEGY
In this section, for the strategy explained in Section II-D, we now propose an algorithm that generates integer load fractions and show that this algorithm guarantees a solution that lies within the acceptable limits of time performance when compared with the optimal solution. It may be recalled that the optimal solution obtained in Section II-D is by assuming that the load is arbitrarily divisible and that all the processors stop computing at the same time. The algorithm that is to be presented in this section restricts the division of the load to integer values. Hence, it is expected that once integer approximation is applied, all the processors will not stop computing at the same time instant. We propose an algorithm that is similar to rounding-off procedure in generating integer load fractions, however takes exactly steps to converge. An interesting feature of the algorithm is that it keeps track of the amount of load that is scheduled so far and the amount of load that remains to be scheduled. This feature avoids the possibility of frequent back-tracking and aids in monitoring the time performance to be within the acceptable limits.
We refer to the optimal solution obtained in Section II-D as . As a starting point of our algorithm, we shall assume that the initial load distribution, given by , denoted as . Let . Obviously, is an integer. Let denote the difference in the values of the load fractions assigned to . Obviously, . Next, we set provided . If this condition fails to hold then, we set . This process is repeated for all the processors and results in a schedule , where , , is an integer. The central idea of this algorithm lies in approximating the real values of the load fractions given by to integer values either by using the smallest integer value that is greater than the current real value or by using a greatest integer smaller than the current real value of the load fraction. This process is adopted at every iteration and the idea is either to "push" the excess load to or "accept" the excess load from the adjacent processors. The algorithm is described below. 
A. Performance Bounds
In this section, we shall derive some important properties that lead to the derivation of the performance bounds. One interesting fundamental issue to explore is the following. At the termination of the algorithm, whether the entire load gets processed or not, which shows the convergence property of the algorithm. We show below that the algorithm indeed guarantees this aspect.
Lemma 1: In the above algorithm, , . Proof: The proof of the lemma is evident from the working style of the above algorithm.
Q.E.D Lemma 2:
Let be defined as in the above algorithm. Then, . Proof: We prove this by contradiction. Suppose the above claim is not true. Then, this means that , and . Also, we know that and is an integer. Rewriting this expression in terms of and , we obtain, . This can be rewritten as, . Since the first term on the LHS, and the term on the RHS are integers, must also be an integer. From Lemma 1, we observe that , since is the only integer that lies between 1 and 1, which contradicts our assumption, thus proving the lemma.
Q.E.D. The above lemma justifies the design of the algorithm by showing that at the completion of the algorithm, the entire load is processed. Now, for a homogeneous system, we shall show that the processing time solution obtained by the above algorithm is no greater than by an amount equal to the sum of communication and computation time of units of load. We assume that (12) is satisfied for this -processor sequence, and hence all the processors participate in processing the load.
Theorem 1: Consider a homogeneous bus network. Let be the optimal load distribution under infinite divisibility assumption that gives . Let be the load distribution generated by the above algorithm that results in a critical path . Then, . Proof: Let . From Fig. 2 , we obtain (13) Similarly, for distribution, we obtain (14) From (13) and (14), we obtain (15) (16) From Lemma 1, we immediately observe (17) (18) Hence the proof.
Q.E.D. The above theorem explains the "near-optimality" of the solution obtained by using the above algorithm to the solution generated under infinite divisibility assumption by an amount equal to the sum of the communication and computation times of the smallest unit load. The following example demonstrates the above algorithm and the theorem.
Example 1: Consider a homogeneous system consisting of processors with the following parameters: , , , and . Let the size of the load be units. Using (10), we obtain the optimal processing time is , and load distribution . Using the integer approximation algorithm, we obtain the integer load distribution . We observe that the results of Lemma 1 and Lemma 2 hold. From (13) , for all , we obtain the finish times of each processor as , and , respectively. The processing time after integer approximation is given by,
. We see that , thus verifying Theorem 1.
IV. CONSTRAINED PARTITIONING
In this section, we tackle another closely related problem of scheduling divisible loads. As mentioned in Section I, we consider the problem of scheduling a divisible load data under the constraint that the entire load cannot be divided into more than partitions. Applications that fall into this kind of treatment mostly belong to image and computer vision data processing domain. One of the typical image processing applications that belongs to such class of problems addressed in this paper is the problem of human facial feature detection using edge counting [12] . This application demands that the entire image data cannot be divided into arbitrarily smaller fractions since the template used in the feature detection process counts the edge pixels into three distinct rectangular areas of the image simultaneously to find a global maximum. These three rectangular areas are separated by a distance that is dynamically altered according to the size of the image and the density of the edge pixels around the feature areas. The minimum size of each of the rectangular areas should be greater than the maximum possible size of a feature area (eye, mouth or nose) of the given facial image. Thus, the image demands a restricted partitioning in and processing concurrently on different processors. The results also hold whenever any restriction on the level of partitioning is imposed. For instance, when granularity of the data becomes an important factor in processing the data on a distributed system, the approach presented in this paper becomes the natural choice and provides a systematic way to construct an optimal schedule. In reality, most of the times the selection of an appropriate grain size (granularity) during the runtime becomes an important factor, as the effective speed of the network and the processors may vary due to the other loading effects. We address the problem of scheduling such loads which impose restrictions in partitioning in a purely arbitrary fashion.
A. Load Distribution Strategy and Recursive Equations
Formally, we state the problem as follows. Let the BCU start to distribute the processing load from to in at most installments, i.e., in rounds of load distribution. More specifically, we pose the following question.
Given a distributed bus network with homogeneous processors and also given that the maximum number of fractions into which the processing load can be divided, what is the optimal load distribution that minimizes the processing time, by taking into account all the overhead components that penalize the time performance?.
Note that, here too, the sizes of the partitioned load fractions assume any value between , where, is the total amount of load. The entire load distribution process by the BCU is represented in the form of a directed flow-graph, as shown in Fig. 4 . The load distribution process takes place in the following manner. Note that the schedule shown in Fig. 4 , consists of communication nodes. Let denote the load fraction assigned to a processor for and for in the installment . Thus, when and , is assigned and in two installments, is assigned and in two installments, and is assigned in the first installment. For this case, for the ease of understanding, we re-denote the finish time path as . Note that the definition of the finish time path is the same, i.e., the sum of the weights of the nodes starting from communication node 1 till the last computation node through the communication node along the directed arrows. Here, denotes the load distribution with nodes, and is given by, . In order to obtain the optimal processing time, we equate all the finish time paths, i.e., , where can be obtained from Fig. 4 ., as (19) Note that we have assumed that processor receives installments out of , and hence the last computation node for will be . It may be noted that by equating , we obtain a set of equations with unknowns. Now, using the fact that the total load is units, we have a total of equations. These equations may be solved to obtain all the individual load fractions as explained in the previous section. However, for the purpose of computational ease, the above set of path equations can be represented in the matrix form. We refer to this matrix as path matrix. Thus, in matrix notation, for and , we represent (19) as shown in the first equation at the bottom of the page where, the row denotes the path . Now, by equating and , as per the principle of optimality [7] , we obtain recursive equations involving variables. With our normalization equation, we can solve all the equations to obtain the individual load fractions. Thus, the difference between and can be represented in a matrix form, referred to as a difference matrix. Thus, for the case and the difference matrix is given by the second equation shown at the bottom of the page. Note that the row in the above matrix is the result of and the last row is the normalization equation. The difference matrix can be generated for any arbitrary and values as described in the procedure presented in Table I .
It may be noted that the procedure presented in Table I will be computationally easy to generate the equations in a matrix form for a given and values. Further, the generated matrix can be reused when either or values are altered. Thus, the solution to our problem lies in simply finding the inverse of the difference matrix and multiplying with the transpose of the RHS column vector. However, it may be noted that it is quite possible that these equations may not yield feasible values (nonnegative values) for the load fractions. This means that these equations are not solvable to yield an optimal solution with partitions and using processors in the system. Thus, if an optimal solution does not exists for this particular value of , then a lower value of is to be attempted by following the above mentioned recursive procedure. However, note that the difference matrix for lower values of can be extracted directly from the already generated matrix. Example 2: Consider a homogeneous network consisting of three processors. Let and . Also, let . Further, let the maximum allowed number of partitions be . This is a typical example of an image processing application in which we consider an image of size , and partition the entire image in terms of number of rows. Hence, we let . Following the above mentioned procedure, we obtain the optimal processing time as units, and the optimal load distribution in given by, (491.4592, 428.6214, 501.3595, 363.3011, 263.2588), respectively. Thus, for this case any lower value of will only result in more processing time. However, suppose if the processing speed of the processors are much faster, say , then following the above procedure, we see that for and , the optimal solution does not exist, as some of the values of the load fractions are infeasible. Thus, the maximum allowed number of partitions is , for which the processing time is given by, 2048 units, and the optimal load distribution is given by, (2027.7, 20.1, 0.2), respectively. This result is of course intuitively true as the processing speeds are much faster than the communication speeds, and the gain achieved in sharing the total load with more than one processor is very small.
It may be observed that the difference matrix generated for case can be reused for case, by considering the first two rows and three columns of case and appending a string of ones as the last (third) row. Further, it may be noted that the values we have chosen for the parameters are purely arbitrary and this choice was just for the purpose of demonstration.
A very important observation to make at this juncture is that, in general, it may be possible that an optimal solution may not exist for . This is mainly due to the combined effect of the overhead components becoming extremely large, resulting in infeasible values. Thus, the best possible solution may not be due to partitioning the load into installments.
V. SUBOPTIMAL PROCESSING TIME SOLUTIONS
Based on the optimal processing time mentioned in the previous section, first, we shall propose two algorithms namely, PIA and IIA that generate integer valued load fractions to individual processors using the optimal solution obtained in the previous section. Both the algorithms use the optimal load distribution as the initial schedule and this means, the number of communication nodes used in the following algorithms remains same as that of the optimal solution. Let us denote this optimal solution as . We present two different types of algorithms carrying out integer approximations.
A. Algorithm PIA
In this algorithm, referred to as PIA (Processor based Integer Approximation), we carry out the integer approximation at each installment for every processor starting from its first installment to its last installment. In doing so, we accumulate the "residue" or the "carry" generated in each iteration and propagate it till node is reached. In other words, an excess load (as defined in the algorithm) at an installment at the processor is carried to the same processor in the immediate next installment. For the ease of developing this method, we now adopt the following notations. Let each processor be assigned number of installments, where , such that . Then, we denote the load fraction assigned to processor in th installment, as . Note that these are obtained from the optimal schedule generated after solving the difference matrix described in the previous section. In the above algorithm the is an accumulated carry at any ( )th iteration. At this juncture, an interesting point to verify is whether the entire load gets processed at the completion of the algorithm or not. We show below that the algorithm indeed guarantees this aspect.
Lemma 3: At the end of any iteration , .
Proof: We shall prove this by mathematical induction. When , the above assertion holds trivially. As an induction hypothesis step, suppose it holds at ( ). Further, let us denote the at th iteration as , and let . From the algorithm, the expression for the sum at step ( ) is given by and by the above hypothesis, we have, . Now, we have the following two cases to consider. . Since the first term on the LHS and the term on the RHS are integers, must also be an integer. From Lemma 3, we observe that as the only integer between and , thus contradicting our assumption. Hence, the proof.
Q.E.D. This lemma justifies the fact that there is no load left without processing.
Lemma 5: For any . Proof: From Lemma 3, and Subtracting first inequality from the second, we prove the lemma.
Q.E.D. We use this result in the following theorem. Now, for a homogeneous system, we shall show that the above algorithm yields a solution that is no greater than the OS by an amount ( ). Theorem 2: Consider a homogeneous system of processors. Let be the under infinite divisibility assumption that gives the optimal solution, denoted as . Let be the schedule generated by Algorithm PIA and let the processing time be . Then, . Proof: For and schedules, the difference in the processing times can be obtained from Fig. 4 as, (20) since, by the definition of . Using Lemma 3 and Lemma 5, the result follows.
Q.E.D.
It may be noted that the bound obtained in this case depends only on the system parameters and and on the number of processors .
B. Algorithm IIA
In this algorithm, referred to as IIA (Installment based Integer Approximation), we carry out the integer approximation starting from node 1 to (see Fig. 4 ) in the order in which these load fractions are distributed. In other words, the "excess" load at a particular processor is carried to its successor in the order of the load distribution. Here too, we assume that the schedule with infinite divisibility assumption, is our initial distribution of the load fractions. Since this schedule is known a priori, we denote the number of processors involved in th installment, as . We now present the algorithm. ; } while ( );
In the above algorithm the is an accumulated carry at th iteration. It may be noted that the lemmas 3 and 4 hold true for this case.
Lemma 6: . Proof: Result follows by applying Lemma 5, and noting that each . Q.E.D. Now, for a homogeneous system, we shall show that the above algorithm yields a solution that is no greater than the OS by an amount ( ). Theorem 3: Consider a homogeneous system of processors. Let be the under infinite divisibility assumption that gives the optimal solution, denoted as . Let be the schedule generated by the Algorithm IIA and let the processing time be . Then, . Proof: The result follows immediately by using similar line of arguments presented in Theorem 2, however, we use Lemmas 3 and 6 in this case. We omit the details.
Q.E.D. It may be noted that this bound depends only on the parameters , and and not on . The following illustrative example describes all the above results.
Example 3: Consider the first part of Example 2. The optimal processing time is given by, . We apply the algorithm PIA to obtain the integer load distribution as , and the processing time of the entire load is given by, , contributed by the finish time path (by processor ). By Theorem 2, we see that . When we apply the algorithm IIA, we obtain the integer load distribution as, , and the processing time of the entire load is given by, , contributed by the finish time path (by processor ). By Theorem 3, we see that . Also, the results of all the lemmas can be immediately verified.
Following theorem suggests a way to make a choice of the above algorithms. Theorem 4: If , then , where and are the suboptimal solutions generated by the respective algorithms.
Proof: The proof follows immediately by obtaining the difference between the suboptimal solutions generated by the algorithms given by Theorems 2 and 3 and using the condition given in the theorem.
Q.E.D Thus, in our Example 3 above, we see that . Hence, for this system, algorithm PIA gives more tighter bound on the time performance.
VI. DISCUSSION OF THE RESULTS
The research contributions in this paper are significant to this domain of DLT. This paper attempts to bridge the gap between theory and practice by tuning the model and demonstrating the performance of the strategies under realistic situations. Since the effect of overhead components is embedded in the closed-form solution for the processing time, the decision on the choice on the number of processors to be utilized under heavily and lightly loaded network conditions can be made immediately. Fig. 3 is a good example to demonstrate this fact. Secondly, any practical implementation, based on the strategies proposed in the literature in DLT, demands quantum load dissemination in terms of either number of rows or columns, need to generate integer load distribution arises. It may be noted that the results of this paper remain unaffected whether column or row-wise striping is carried out in load distribution process. In Example 2, we carry out row striping in partitioning the load. In Section III, we proposed an integer approximation algorithm to generate these integer valued loads in a systematic way such that the suboptimal solution that is recommended by this algorithm can be very well quantified. Through this procedure, we have obtained an ultimate performance bound on the solution generated by this algorithm.
One of the useful contributions is the directed flow graph representation. With the aid of such representation, for instance, a multi-installment strategy [20] , [21] , [7] discussed in the literature can be easily be constructed and recursive equations can be generated by simply writing the path equations. We believe that this representation is elegant in capturing all the features of the timing diagram and also serves as a generalized representation to construct and analyze complex scheduling strategies in this domain.
We also tackle a closely related problem, but highly useful for a specific class of applications in image and computer vision processing areas. The results also hold whenever any restriction on the level of partitioning is imposed. For instance, when granularity of the data becomes an important factor in processing the data on a distributed system, the approach presented in this paper becomes the natural choice and provides a systematic way to construct an optimal schedule. In reality, most of the times the selection of an appropriate grain size (granularity) during the runtime becomes an important factor, as the effective speed of the network and the processors may vary due to the other loading effects. As a consequence, the model must be tuned to take care of this size dependent behavior. Since from practical perspective, generating integer load fractions is mandatory, we propose two different algorithms to generate integer load fractions namely, IIA and PIA. The algorithms are shown to converge and guarantee that the time performance to be well within acceptable limits. The choice of these algorithms, as shown in Theorem 4, is very important for time critical applications, as the suboptimal solution is influenced by the speed parameters of the system and the allowed number of partitions. The integer approximation algorithm proposed in Section 3 has the complexity . The complexity of the algorithms PIA and IIA is . As a final remark, we would like to point out the following. In reality, the quantities and are random, and we can only discuss the performance with respect to the average values of these quantities. The average values can be either estimated or obtained as a result of an empirical study. If we consider the average values for these quantities in Theorems 2-4, then these results are much more meaningful in giving an estimate of the suboptimal processing time solution under average performance.
VII. CONCLUSIONS
The problem of scheduling divisible loads on bus networks is considered. The research contributions in this paper have attempted to bring the theory closer to practice by employing a realistic model and analyzing the performance under practical conditions. This paper has the following major contributions. A directed flow graph representation was introduced to describe the load distribution process. This flow-graph representation is elegant in describing all the features of the load distribution strategy and can be very well altered depending upon the strategy. The mathematical model considered in this paper included all the overhead components that penalize the time performance in determining the optimal solution. Using this model, we have derived a closed-form solution for the optimal processing time. The closed-form solution clearly reflects the influence of the overheads and gives rise to an important necessary and sufficient condition for the existence of the optimal processing time by employing all the processors in the network. We have demonstrated the influence of these overhead factors on the time performance and its importance in the choice of number of processors that can be utilized in the network. We extended our study to suit a special class of problems in which partitioning of the data is restricted to a finite number. We have proposed a systematic way to generate the optimal solution.
From practical perspective, we have proposed different integer approximation algorithms that generate suboptimal solution from the optimal solution obtained through earlier analysis. The behavior of these algorithms are analyzed and for the class of homogeneous networks, we have derived the ultimate performance bounds. These bounds serve as an excellent estimate of the time performance of the system under different network conditions and indicate how far the solution(in practice) lie when compared to the ideal optimal solution. As mentioned in Section VI, these solutions may very well reflect the average time performance when the speed parameters are replaced by their average values.
Extensions to this research can be devoted to analyze the behavior of the strategies and the integer approximation algorithms on different architectures. It would be interesting to derive similar performance bounds for these architectures. Also, the effect of solution back propagation, which is beyond the scope of this paper, can provide a complete understanding of the behavior of the system.
