AbstractÐOptimal distribution of divisible loads in bus networks is considered in this paper. The problem of minimizing the processing time is investigated by including all the overhead components that could penalize the performance of the system, in addition to the inherent communication and computation delays. These overheads are considered to be constant additive factors to the respective communication and computation components. Closed-form solution for the processing time is derived and the influence of overheads on the optimal processing time is analyzed. We derive a necessary and sufficient condition for the existence of the optimal processing time. We then study the effect of changing the load distribution sequence on the time performance. Through rigorous analysis, an optimal sequence to distribute the load among the processors is identified, whenever it exists. In case such an optimal sequence fails to exist, we present a greedy algorithm to obtain a suboptimal sequence based on some important properties of the overhead factors. Then, the effect of granularity of the data that is divisible is considered in the analysis for the case of homogeneous networks. An integer approximation algorithm capable of generating integer values of the load fractions in time ym, where m is the number of processors in the network, is proposed. We then show that the upper bound on the suboptimal solution generated by our algorithm lies within a radius given by the sum of the computation and communication delays. Several numerical examples are presented to illustrate the concepts.
AbstractÐOptimal distribution of divisible loads in bus networks is considered in this paper. The problem of minimizing the processing time is investigated by including all the overhead components that could penalize the performance of the system, in addition to the inherent communication and computation delays. These overheads are considered to be constant additive factors to the respective communication and computation components. Closed-form solution for the processing time is derived and the influence of overheads on the optimal processing time is analyzed. We derive a necessary and sufficient condition for the existence of the optimal processing time. We then study the effect of changing the load distribution sequence on the time performance. Through rigorous analysis, an optimal sequence to distribute the load among the processors is identified, whenever it exists. In case such an optimal sequence fails to exist, we present a greedy algorithm to obtain a suboptimal sequence based on some important properties of the overhead factors. Then, the effect of granularity of the data that is divisible is considered in the analysis for the case of homogeneous networks. An integer approximation algorithm capable of generating integer values of the load fractions in time ym, where m is the number of processors in the network, is proposed. We then show that the upper bound on the suboptimal solution generated by our algorithm lies within a radius given by the sum of the computation and communication delays. Several numerical examples are presented to illustrate the concepts.
Index TermsÐDivisible loads, communication delay, processing time, optimal sequence, bus networks. ae
INTRODUCTION
P ROCESSING time minimization of the jobs arriving at any computer site in a distributed computing system (DCS) is one of the major objectives in the research area of computing and communications. The primary thrust in such studies is to design efficient scheduling algorithms that minimize the total processing time of the jobs [12] , [17] , [29] , [36] . This is achieved by optimally distributing the jobs among the set of processors available in the system. There is a vast amount of literature in the domain of load sharing for indivisible [36] and modularly divisible [26] , [37] loads. An excellent compilation of these results can be found in [36] . The domain of scheduling divisible loads in multiprocessor systems is of recent origin (from 1988) [15] and has stimulated considerable interest among the researchers in the field of scheduling and computer networks. The main motivation for the research in this area evolved from the requirements of processing a large volume of data that arrives to a distributed intelligent sensor network in military surveillance systems. However, as the theory of divisible load scheduling continued to evolve, a wide spectrum of application domains benefited. To quote a few, image processing applications using Hough Transform [22] , intensive computations in military and space programs involving matrix-vector products of very large sizes [20] , [21] , computer vision data processing [25] , [30] , query processing in database systems [1] , distributed biomedical image processing [24] , and distance learning. Recent works [14] , [35] also show the necessity of considering economic models in distributing the load among the processors. One of the primary objectives is to assign optimal fractions of the total load among several sensors/processors such that the entire load is processed in a minimal amount of time. In [9] , discussions on some specific applications in which such types of loads can be encountered have been presented. This domain of research is commonly referred to as divisible load theory (DLT) in the literature.
A divisible load can be divided into any number of fractions and can be processed independently on the processors, as they bear no mutual causal precedence relationships. The theory of divisible load scheduling [9] involves mathematical modeling of the network parameters, processor speed parameters, and also on the sizes of the loads. The relationships between these quantities are assumed to vary in a linear fashion. The mathematical model adopted is such that the communication time over a link is assumed to be proportional to the amount of load that is being transferred over any link and the computation time of this load by any processor is proportional to the amount of load assigned to it. Such a kind of modeling was assumed in the literature while analyzing a variety of scheduling problems on multiprocessor systems [1] , [20] , [21] , [25] . Several load distribution strategies have been designed to optimally divide and assign the load fractions to the processors. The research reported in this paper is a contribution to the DLT domain and considers a bus network architecture with an improved realistic modeling and shows how this model leads to a richer set of nontrivial solutions that play a crucial role in the design of such divisible load schedulers. We now present our research contributions in this paper in brief.
Research Contributions
This paper proposes a revised model to the theory of divisible load scheduling and demonstrates its implications in scheduling divisible loads. We consider a bus network architecture for the load distribution analysis. In any realistic data communication and computing environments, one cannot simply circumvent the overhead delays associated with the processing. In the case of communication process, these overheads ( m ) manifest in more than one form as protocol processing delays, especially when the data to be packed is very large, delays due to the unavailability of certain internal and external communication resources, queuing delays at intermediate sites, etc. ( [1] , [5] ). As mentioned in [5] , depending on the system and its function, one or more of the above delays may be neglected. In fact, all the earlier studies in divisible load theory [9] considered only the transmission time of the load in designing the load distribution strategies. Similarly, during the computation phase of the load, these overheads ( p ) manifest as layered protocol delays, delays while depacketizing and extracting the data, processor initialization, etc. In practice, these overheads are almost constant quantities [1] and naturally form as additive components to the respective processing components (communication and computation times). This model was considered earlier in the literature by some researchers [1] , [11] , [13] , [16] , [18] ; however, their studies were focused on some specific cases. In [1] , these additive overheads are categorized and considered for query processing and image processing applications. In [11] , recursive equations for several network architectures were presented, but the analysis was not extended to identify certain nontrivial properties that are extremely important while conducting trade-off studies. Another related work using this model can be found in [25] . This study is in the domain of static scheduling of computer vision data on distributed memory multiprocessors.
Our contributions in this paper are the following: With the above-mentioned linear model with overheads, we first derive closed-form solutions for the optimal processing time, demonstrate the effect of the overheads on the processing time and identify certain important properties. Second, we identify an important property so far exhibited only by heterogeneous tree networks [6] , [9] , namely the sequencing in the case of bus networks too. By changing the order in which the load is distributed among the processors, we observe that the processing time is affected. In the DLT literature, in the case of bus networks, it has been proven that the processing time remains independent of the sequence in which the load is distributed among the processors. However, with the inclusion of overheads, the results of this paper demonstrate the effect of sequencing for the case of bus networks, too. Using the concept of sequencing, we identify an optimal sequence with a special property, whenever such a sequence exists. To do this, we identify and prove a number of useful and interesting properties exhibited by the combined effect of the overhead components. In case such an optimal sequence does not exist, we present a greedy strategy to identify a sequence that generates a suboptimal solution that suits most of the practical situations. Finally, for the case of the homogeneous system, we propose an integer approximation algorithm that generates integer load fractions. Further, the processing time is shown to be within a radius that is not more than the sum of computation and communication times with respect to the optimal solution.
Related Work
Applications of DLT can be found in [9] , [20] , [22] , [30] . The mathematical modeling we had employed was studied in the literature by different researchers in different contexts [1] , [11] , [25] . In [15] , a heterogeneous linear network of processors was considered and a computational algorithm was developed to obtain the optimal load fractions by assuming all the processors stop computing at the same time instant. In fact, this has been shown to be a necessary and sufficient condition for obtaining optimal processing time in linear networks [28] by using the concept of processor equivalence. An analytic proof of this assumption in bus networks is presented in [31] , [34] . However, it has been rigorously shown that in the case of heterogeneous singlelevel tree networks, this condition is true only in a restricted sense [7] , [9] . A similar proof was also presented in [15] for linear networks. In the case of single-level tree networks, a closed-form expression for the processing time had been derived [6] , [23] and an algorithm was proposed to obtain an optimal tree configuration for a special case. The concept of optimal sequencing and optimal network arrangement were introduced in [6] . For homogeneous linear networks, a closed-form expression for the processing time was presented [27] and asymptotic solutions for tree, bus, and linear networks have been rigorously derived [2] , [19] .
To study the ultimate performance limits in the case of single-level tree networks, a multiinstallment load distribution strategy was introduced and bounds on the performance limits have been derived [8] . Essentially, in this strategy, the processing load is distributed in more than one installment. Closed-form solutions were derived for homogeneous single-level tree networks. Subsequently, this multiinstallment strategy was applied to linear networks and closed-form solutions for processing time for homogeneous networks was presented in [7] , [9] . The study in [11] analyzes the load distribution problem on a variety of computer networks.
All these studies focus their attention on the situation that only one load is available for processing. This assumption was relaxed in [32] and an efficient algorithm was proposed for multiple-jobs in bus networks. Very recently, the fault-tolerant aspects of the system has been studied on bus networks [3] . From a practical perspective, a more accurate analysis has been presented for bus networks through efficient modeling of links and processors as time varying quantities rather than assuming them as constant parameters [33] . Scheduling the processing loads, which consist of both arbitrarily divisible and indivisible components, was also considered in the literature and a heuristic algorithm was proposed [4] .
The studies so far assumed that all the processors in the network are available from the instant at which the load arrives at the originator. In [10] , this assumption has been relaxed and an efficient algorithm has been designed to take into account the processor release times at the time of load origination. A recent study on arbitrary tree networks [1] presents a rigorous analytical treatment in deriving optimal sequences when both the load distribution and results collection phases are considered. For the first time in DLT, the objective of minimization using economic models was proposed in [14] , [35] . Here, the objective is to minimize the monetary costs involved in the process of divisible load processing. Also, this study presents a good discussion on the trade-off relationships between the time and monetary cost parameters.
The organization of the paper is as follows: Section 2 presents the mathematical modeling and relevant definitions and introduces the required notations. Section 3 presents the closed-form solutions and the concept of equivalent processors. Here, we demonstrate the processing time behavior under the influence of the overheads. Section 4 presents the concept of sequencing and identifies an optimal sequence. Section 5 presents an integer approximation algorithm and derives an upper bound on its time performance rigorously. Section 6 presents a detailed discussion on the results obtained and suggests some future extensions to this research.
MATHEMATICAL MODELING, DEFINITIONS, AND SOME REMARKS
We consider a bus network architecture as shown in Fig. 1 . The network may not have a dedicated control processor or a bus controller unit (BCU) [31] , [34] to distribute the load among the processors. In this paper, we shall present a rigorous analysis of the time performance of the system for the case when the network has a dedicated BCU to distribute the entire load. It is worth mentioning at this stage that a bus network is equivalent to a single-level tree network or a star network when all the links have identical speeds [9] . Thus, all the research contributions in this paper also hold for this special class of single-level tree networks.
Load Distribution Strategy and Some Definitions
The load distribution strategy is described as follows. The divisible load is assumed to originate at the BCU. The BCU divides the load into m load fractions, denoted as I Y F F F Y m , and distributes them among all the mEproessors in a particular sequence, say p I Y F F F Y p m , one after the other. Upon receiving their respective load fractions, the processors start computing their respective load fractions. The problem is then to determine the optimal sizes of these load fractions that are assigned to the processors such that the total processing time is a minimum. We now introduce some useful notations, definitions, and terminology that will be used throughout the paper.
i : Fraction of the load assigned to processor p i . We shall denote the product w i p as i i and z m as g, respectively, throughout the paper. Thus, using the above notations, we see that the communication time of a fraction of the load i is given by i g m and the computation time of this load fraction by p i is given by i i i p . The load distribution process is described by means of a timing diagram, as shown in Fig. 2 . In this paper, we shall show that the presence of such overheads, which are inherently present in any realistic system, will drastically affect the time performance and may lead to different design decisions. This is crucial whenever the optimality of the solution and related trade-off studies are important in the design. A detailed discussion is presented in Section 6. We shall now define the following: It has been rigorously proven in the literature [9] that for optimal processing time, all the processors involved in the computation of the load must stop computing at the same time instant. Here, too, we shall use this optimality criterion to analyze the processing time performance of the system.
CLOSED-FORM SOLUTIONS FOR THE PROCESSING TIME
In this section, we shall derive a closed-form expression for the optimal processing time by assuming that the sequence of load distribution is from p I to p m in that order. Throughout this section, this sequence of load distribution will be referred to as a fixed sequence.
Recursive Equations and Optimal Solution
From Fig. 2 , we obtain the following recursive equations
where,
iiIg ii f iI , and X V
Substituting (6) in (3), we obtain the individual load fractions. From Fig. 2 , we obtain the expression for the processing time as
Using i I in (3) and substituting in (9), we obtain
where m , w I , and x I are as above defined. Thus, in the above analysis, we have obtained an optimal solution involving m processors by solving the set of recursive equations, as shown above. Following the above steps, we can derive the optimal processing time for a system of m processors with a BCU without overheads ( m p H). The optimal processing time in this case, is given by
where w I and m are as above defined. It is worth mentioning at this juncture that given a m-processor system, with the inclusion of all the overheads, it may not be necessary that an optimal solution exists when one attempts to utilize all the m-processors. This behavior can be seen from Fig. 3a . Here, we observe that as we tend to increase the number of processors, the processing time decreases. For different link speeds, we see that the processing time varies in a similar pattern. Also, we have shown the influence of m (given by (7)) which we will refer to as the overhead factor in Fig. 3b . From this figure, we see that the overhead factor m also increases as m increases, for a particular value of g. From the figure, we observe that as g increases, the slope of m also increases significantly. Also, we observe that for the speed and overhead parameters chosen, the maximum number of processors, m Ã , that can be utilized with the given sequence of load distribution varies for different link speeds. Thus, beyond this m Ã , an optimal solution ceases to exist. This is due to the fact that the value of the overhead factor becomes greater than one, and hence, there will not be any gain in the time performance even if we attempt to utilize more processors beyond m Ã . From Fig. 3a , it may be noted that for the system with slower link speeds, m Ã is smaller. This is because of the fact that the larger communication delay affects the overhead factor m more adversely, as can be seen from Fig. 3b , too. For instance, from Fig. 3a we observe that when the link speed parameter decreases from 0.2 to 0.1 (links become faster), m Ã increases by a significant amount and allows us to utilize four more processors (from 10 to 14) to gain 41.5 percent decrease in processing time. Thus, as long as the value of m is less than 1, we can utilize all the m processors to process the entire load in a minimum amount of time with this fixed sequence. Thus, we see that a necessary and sufficient condition for the existence of an optimal processing time using all the mEproessors in a specific order is given by What we observe here is a restricted monotonic nature of the processing time behavior as opposed to the case when no overheads were considered in the problem formulation [9]. Given a fixed sequence, whenever an optimal solution using a k-processor system p I Y F F F Y p k ceases to exist, then we may utilize a maximal subset of k H , k H`k processors in the same sequence involving processors p I Y F F F Y p k H to obtain the optimal processing time. Of course, again we have to check Condition (12) for this set of k processors. Another observation that one could make from the Fig. 3 is on the rate at which the decrease in the time performance is achieved. Beyond m U or m V, we see that the rate of decrease of the processing time is not significant and hence one can utilize m U or 8 processors without having to utilize all the processors until m Ã IR. This in a way reduces the overhead processing by the system considerably, as it can be seen by the increase in the values of m for m U and m V, respectively.
Concept of Equivalent Processor
Now, we shall introduce the concept of equivalent processor which will be useful to prove a number of important results in the subsequent sections. The concept of equivalent processor has been introduced in the theory of scheduling divisible loads in a variety of situations [1] , [7] , [28] . We shall first derive a closed-form expression for the equivalent processor from the timing diagram shown in Fig. 4 . Here, the basic idea is to collapse a set of processors and to form a single equivalent processor such that analysis of load distribution strategies is easier to perform. 
The above lemma demonstrates the fact that whenever an m-processor optimal solution exists, then the optimal solution for any kEproessor, k`m system also exists when the respective systems follow the same sequence (fixed sequence) of load distribution. Now, using the result of Lemma 1, we shall now demonstrate the monotonic behavior of the processing time for an mEproessor fixed sequence. 
The above theorem signifies the fact that whenever the value of the overhead factor is less than 1 for an m-processor system with a given sequence, then the optimal processing time by the m-processor system is less than the optimal processing time with m À IEproessor system following the same sequence. Thus, it clearly demonstrates the monotonic behavior of the processing time.
Homogeneous System
As a special case, for a homogeneous system (w i w, and hence, i i i Y i IY F F F Y m), the overhead factor m given by (7) can be written as
and the value of the denominator m in (8) can be written as
where, f i gai b I, m ai. It may be noted that from (23), we observe that m varies in a linear fashion with respect to . Substituting (23) and (24) in (10), we obtain,
PS
In the homogeneous case, we see that the necessary and sufficient condition for the existence of an m-processor optimal processing time given by (12) can be simplified using (23) to yield the following condition:
where gf f m À f P P À mf. Figs. 5 and 6 show the behavior of the processing time with respect to the link and processor speed parameter variations. Fig. 5a shows the behavior of the processing time for different link speeds with respect to the number of processors utilized. Similar to the heterogeneous case, here too we note that beyond m Ã ISY IIY V, respectively, the optimal processing time ceases to exist, as the value of corresponding m b I, as shown in Fig. 5a . We observe that as C increases m 
PROCESSING TIME MINIMIZATION USING THE CONCEPT OF SEQUENCING
In this section, we shall first demonstrate through a numerical example that the time performance of the system can be altered by changing the sequence of load distribution. This behavior was never exhibited by this system when the modeling did not take into account the overheads [9] , [31] . However, it is due to the presence of the overheads, which are nonzero quantities, that such a
VEERAVALLI ET AL.: ON THE INFLUENCE OF START-UP COSTS IN SCHEDULING DIVISIBLE LOADS ON BUS NETWORKS 1295
Fig . 5 . Behavior of the processing time and the overhead factor for different link speeds (homogeneous system). behavior is observed. Clearly, it would be interesting to determine which sequence would be optimal, if one exists among all possible load distribution sequences. An interesting point here to be noted is that, in the case of singlelevel tree networks with identical link speeds, the concept of sequencing has no effect on the time performance of the system, as proven in [7] , [15] . With this kind of realistic modeling considered in this paper, we will show that the above claim is no longer valid. Also, we shall demonstrate all these observations in the following sections.
Motivating Example
Example 1. Consider a bus network with m S processors, with the following speed parameters: g HXPHH, m HXHIH, p HXHIH, i I HXI, i P HXP, i Q PXI, i R QXU, i S PXW.
We first distribute the load in the following order:
where the load is first assigned to p I , then to p P , and so on until p S in the above order. Now, using the closed-form solution (10), we obtain t h e o p t i m a l p r o c e s s i n g t i m e a s Y S HXPTP, where the optimal load distribution is given by, HXVHVY HXIUUY HXHIIY HXHHQY HXHHI. Alternatively, we may distribute load to the processors in the following order:
That is, the load is first assigned to p S and then to p R , and so on until p I in the above order. Now, again using the closed-form solution (10), we obtain the optimal finish time for this sequence of load distribution as H Y S HXPVV, where the corresponding load distribution is given by, H HXHVTY HXHTPY HXHWSY HXRURY HXPVQ. Thus, we observe that the processing time depends on the sequence or the order in which the load is distributed among the processors. Also, note that Y S` H Y S.
Thus, we see that when the system has m processors, we have m3 different load distribution sequences possible.
However, from the results of the previous section, we note that not all mEproessor sequences need to exist. This is due For the purpose of notational convenience, we introduce the following. Note that this change of notation is to include the influence of processor sequence on the processing time and also for the ease of analysis. It may be noted that the expressions for the following quantities are as derived in Section 3.
1. Let 2 j k indicate the jth k-processor sequence in some specified order of load distribution. We represent such a sequence as 2 j k p I Y p P Y F F F Y p k , where the first load fraction I is assigned to p I , the second load fraction P is assigned to p P , F F F , and the last portion of load is assigned to the kth processor. Thus, the above notation 2 j k means we have k processors, implying there are k3 load distribution sequences possible, and we denote a sequence by the index j. Note that in each sequence, without loss of generality, we denote the processor which receives the load first as p I , the processor which receives the load second as p P , and so on.
We shall refer to the above sequence 2 j m as a fast sequence. 3. Let Y kY 2 j k denote the optimal processing time for the sequence 2 j k, where 2 j k P É k , is as above defined. 4. Let 2 j kY k denote the overhead factor with k-processors involved in processing and 2 j k P É k .
The expression for Y kY 2 j k is given by (10) . The expression for 2 j kY k is given by (7) and the necessary and sufficient condition for the existence of an optimal solution with k processors is given by (12) using m k. Also, we denote iYj m ai j and f iYj g i j ai jÀI in the ith mEproessor sequence.
In the subsequent sections, we present some important properties of the sequencing which will be used to identify an optimal sequence with a special property, if it exists in É.
Optimal Sequencing
In this section, we shall first prove some important intermediate results that are useful in identifying an optimal sequence. The proofs of the lemmas and the theorems can be found in the appendix.
denote the new sequence obtained by swapping p k and p kI , as The lemma signifies the fact that appending a new processor to an m-processor pool in a particular sequence does not change the overhead factor gap between two m-processor sequences differing in two adjacent swapped processor positions. This is an important result which will be used in our proof of the optimal sequencing theorem later.
Lemma 3. Consider the sequences 2 I m and 2 P m in Lemma 2.
Let m ! P, and i k`ikI , i.e., the processor p k is faster than processor p kI . Then, 2 I mY m b 2 P mY m.
The lemma signifies the fact that when a pair of adjacent processors in increasing order of speeds are swapped in a sequence, the resulting sequence has an overhead factor that is strictly less than the overhead factor for the original sequence in which the processors are not swapped. The lemma shows that, if the fast sequence exists, then the overhead factor for the fast sequence has the larger value than the overhead factor for all other sequences in É. This important property of the fast sequence will be used in proving the optimal sequence theorem later. This lemma will serve as an important property to prove the optimal sequence theorem later in this section.
This theorem shows that the optimal processing time using sequence with a fast pair of processors p k and p kI is less than the optimal processing time when the processors p k and p kI in that sequence are swapped. This property is critical and will be used to prove the following optimal sequence theorem. 
The result of this theorem is an important contribution to this domain of research. The importance of the theorem lies in eliciting the property of sequencing in the case of bus networks. It shows that, given an mEproessor system, if all the processors can be utilized by following a sequence in which the processor speeds decrease, then such a sequence results in an optimal time performance. A brief discussion on this result is presented in Section 6. Since the presence of the overhead factor affects processing time, in general, it may be possible that an mEproessor fast sequence may not exist at all. In the following numerical example, we demonstrate this observation.
Example 2. Consider a system with m S processors and the following speed parameters: g HXR, p HXHHS, m HXHI, i I HXI, i P HXQ, i Q HXR, i R HXU, i S IXH. First, we choose the sequence
Note that 2 I S is a fast sequence. But, 2 I SY S given by (7) is greater than one, and hence, we observe that a fast sequence with all the five processors does not exist. Now, consider the sequence 2 P S p Q Y p S Y p R Y p I Y p P and it may be verified that 2 P SY S`I, and using (10), we obtain the optimal processing time as Y SY 2 P S HXRSUS. Instead, we may choose the fast sequence with only m À I processors 2 I R to distribute the load. Using (7), we observe that 2 I RY R HXWPS`I, and obtain
Therefore, we can draw the conclusion that Y mY 2m, where 2m is not the fast sequence in m-dimension, need not be less than
is the fast sequence in É mÀI .
PERFORMANCE BOUNDS USING INTEGER APPROXIMATION TECHNIQUES
As mentioned in Section 1, from a practical perspective, a divisible load, in general, may not be truly arbitrarily divisible. An example of this scenario would be in computing a large size matrix-vector product on parallel and distributed systems [20] , [21] or in any image processing application [25] . A large size matrix or a vector may be partitioned for the ease of complex computations. In this case, the arbitrarily divisible property no longer holds as the load that will be distributed among the processors will be in terms of number of rows or columns and not as a fraction of the rows or columns. Thus, if our model can be tuned to fit this requirement, then any realistic situation can be handled. For instance, in the above mentioned applications, usually the load that is assigned to a processor will be in terms of a certain number of rows or columns and hence, we can say that the load that is assigned will be an integral multiple of some fundamental quantity. This fundamental quantity is referred to as the divisibility factor and is defined as the minimum possible granularity of any load fraction that can be assigned to a processor. This is denoted as . Further, we assume that the total load contains L units of load. In other words, the total load is v, L being a variable and a constant. Without loss of generality, all the subsequent results assume I. Further, we refer to the optimal solutions obtained in Section 3 as y. Also, we continue to use the terminology ªfractionº even for integer valued load assignments, for the sake of preserving the meaning of the ªamount of loadº assigned to individual processors. In this paper, for the case of homogeneous systems, we now propose an algorithm that generates integer load fractions using the y and show that this algorithm guarantees a solution that lies within the acceptable limits of time performance when compared with the optimal solution y.
Algorithm
It may be recalled that the y is obtained by assuming that the load is arbitrarily divisible and that all the processors stop computing at the same time. The algorithm that is to be presented restricts the division of the load to integer values. Hence, it is expected that once integer approximation is effected, all the processors will not stop computing at the same time instant. It may be noted that to obtain integer valued load distribution, it may be possible to use a simple rounding-off procedure on the real values to the nearest integer values. However, using such a naive strategy has serious disadvantages. First, the algorithm may not guarantee convergence and secondly, it may not guarantee a bound on the sub-optimal solution that is generated. In the algorithm to be presented and in the subsequent analysis, we assume that (26) is satisfied, i.e., m`v (as the total load is now v units), where m is given by (23) . Note that in this case, m iI i v such that H i v. As a starting point of our algorithm, we shall assume that the initial load distribution is given by y, denoted as
I is an integer. Let Á I H I À I denote the difference in the values of the load fractions assigned to p I . Obviously, Á I`I . Next, we set H P d P e provided Á I Á P`I . If this condition fails to hold, then we set H P P . This process is repeated for all the processors and results in a schedule H
, is an integer. The central idea of this algorithm lies in approximating the real values of the load fractions given by y to integer values either by using the smallest integer value that is greater than the current real value or by using a greatest integer smaller than the current real value of the load fraction. This process is adopted at every iteration and the idea is either to ªpushº the excess load to or ªacceptº the excess load from the adjacent processors. The algorithm is described below.
In the above algorithm, the sum i represents the accumulated carry ( iÀI jI sum j Á i ), where Á i H i À i ) in the ith iteration. Example 3 illustrates the above procedure.
At this juncture, an interesting point to verify is whether the entire load gets processed at the completion of the algorithm. Below we show that the algorithm indeed guarantees this aspect.
Lemma 6. In the above algorithm, ÀI sum i I, i IY XXY m.
Proof. We prove this by induction. In the first iteration, the above assertion is true. As an induction hypothesis, let us suppose this is true at kth iteration. This means, ÀI`sum k k iI Á i`I . The following are the two cases to be proven: 
Thus, the assertion holds in all the iterations, proving the lemma. t u Thus, the lemma shows that the accumulated carry at any iteration lies between -1 and +1. Since the first term on the LHS, and the term on the RHS are integers, x must also be an integer. From Lemma 6, we observe that x H, since x H is the only integer that lies between -1 and +1, which contradicts our assumption, thus proving the lemma. t u
The above lemma justifies the design of the algorithm by showing that at the completion of the algorithm, the entire load is processed. Now, for a homogeneous system, we shall show that the processing time solution obtained by the above algorithm is no greater than y by an amount equal to the sum of communication and computation time of units of load. It may be noted that in the case of a homogeneous system, we have just one mEproessor sequence and hence, 2 i m 2m, for all i IY F F F Y m3. Thus, Y mY 2m can be represented simply as Y m. Further, we assume that (26) is satisfied for this mEproessor sequence. Using the integer approximation algorithm, we obtain the integer load distribution H SPY PTY IQY TY Q. We observe that the results of Lemma 6 and Lemma 7 hold. From (27) , for all i IY PY F F F Y S, we obtain the finish times of each processor as IHRXHPIY IHRXHPPY IHRXHPQY IHQXHPR, and IHQXHPS, respectively. The processing time after integer approximation is given by H Y S IHRXHPQ. We see that H Y S` Ã Ã Y S g i, thus verifying Theorem 4.
DISCUSSIONS OF THE RESULTS
The linear mathematical model proposed in this paper considers the overhead components that penalize the time performance of the system. While this was neglected in the earlier literature [9] on the grounds of its minimal effect due to very small values, under high network traffic conditions, the delays may be significant. In fact, our derivation of the closed-form solutions for the processing time clearly reflects the combined effect of the overhead components. Typical values of the network and node delay parameters will be in the order of 136-200 "secs, as mentioned in [9] and [25] . Now, if the size of the load fractions that are assigned to the processors is considerably large, the combined effect of the actual transmission and the overheads may severely affect the performance. From Figs. 3 , 5, and 6, we observe this behavior of the processing time and the corresponding overhead factor. Clearly, under severe network and node delay conditions, one has to compromise on the use of number of processors in deciding the optimal finish time of the load, as discussed in Sections 3.1 and 3.3. Fig. 7 shows the behavior of the processing time when we vary the overhead parameters. In this figure, we have kept the computational overhead as a constant ( p HXHHP) and shown the impact of the communication overhead varying from H to HXHS. Clearly, when compared to the case without overheads ( m p H), there is a significant amount of variation in the processing time between the low and high overhead magnitudes. This is apparent from Fig. 7 , when m HXHHI and m HXHS. However, if we vary p , for a particular value of m , we can immediately see that all the curves will have identical m Ã values, as the expression for m (12) is independent of p .
The closed-form solution for the processing time presented in this paper clearly demonstrates the influence of the overhead factor on the time performance. Using the overhead components, Theorem 1 shows a restricted monotonic behavior of the processing time as opposed to the observations reported in the literature [9] . As it can be seen readily, this behavior automatically leads to a condition on the existence of the optimal processing time with a given set of processors. Thus, for a fixed sequence of load distribution, there exists a maximal set of processors ( see Fig. 3 ) that can yield optimal processing time. For very small values of m, the behavior of m may be analytically tractable; however, as m increases, the behavior of m is tedious to track.
An important contribution of this research is in the minimization of the processing time using the concept of sequencing. In the literature, the effect of sequencing has been shown to have a considerable effect in the case of heterogeneous single-level tree networks [6] . Also, in [9] , for the bus networks, it was shown that the optimal processing time remains unaffected regardless of the order in which the load is distributed among the processors by the root processor. While this still holds when the overheads are negligible, here we demonstrate the effect of sequencing on the time performance in the presence of all the overhead components. A number of important properties exhibited by the overhead factor given by (7) were derived. Using these properties, we have shown that, if the overhead factor for the sequence in which the load is distributed among the processors in the order of decreasing speeds (referred to as the fast sequence) is less than one, then this fast sequence yields an optimal processing time. In fact, by using the concept of sequencing, in case fast sequence fails to satisfy (12) , it is possible to choose some other sequence that satisfies the above condition and can use all the m processors. However, this sequence need not guarantee an optimal processing time. The first part of Example 2 shows this aspect. In fact, the concept of sequencing in the case of bus networks provides a flexibility in utilizing the number of processors and still may provide an optimal solution. In other words, if the number of processors are restricted to say, k`m, it is possible to obtain an optimal time performance by choosing the fast sequence with these k processors. Now, in the case where an mEproessor fast sequence fails to exist, we note that the search space for an optimal sequence is very large, especially when m is large. Example 2 shows that the fast sequence using m À I processors yields a better processing time than some arbitrary mEproessor sequence. However, this does not guarantee that the fast sequence using m À I processors will yield an optimal processing time. This is a very crucial observation to make, especially if the design requirements depend on the utilization of certain number of processors. This behavior of the processing time actually prevents from making naive decisions on the choice of the number of processors and the sequence in which the load to be distributed, as these two quantities behave in a somewhat ªpush-pullº fashion. Under these circumstances, it may be possible to design a number of greedy algorithms to yield suboptimal solutions. However, unless such a greedy algorithm converges to a suboptimal solution in a polynomial time, the search could be CPU time intensive. Also, the suboptimal solution that is generated by these algorithms must lie within the acceptable limits. We suggest the following greedy algorithm that generates a suboptimal solution that is suited for most of the applications. The rationale behind this algorithm lies in making use of a maximal set of processors in the order of their decreasing speeds.
Greedy Algorithm
Let É fst be the set of all fast sequences in É. Note that the number of fast sequences in É is m, as in each É k Y k IY F F F Y m, we have one fast sequence.
For (k mY k b IY k k À I) { Let 2 fst k P É fst and let 2 fst kY k be the associated overhead factor. Then,
Clearly, the complexity of the algorithm is ym. Further, the algorithm considers only the fast sequences in every É k Y k IY F F F Y m. When Condition 12 is satisfied for a certain fast sequence in a certain dimension r m, the algorithm terminates, thus limiting our search to at most m iterations. The concept of sequencing can also be used to make wise decisions on the choice of the network and the number of processors on which the load may be processed. Even if one uses a large network having very low average overhead delays, an inadvertent choice of a sequence, when a fast sequence fails to exist, may degrade the time performance than choosing a small sized network with large average delays. This is one of the important trade-off relationships between the number of processors to be used, the combined influence of the overhead components, and the sequence of load distribution. Alternatively, the negative effect of not obtaining a fast sequence is that an inferior time performance may result. This is crucial, especially when the underlying network has large delays and overheads. Under these circumstances, the greedy algorithm proposed becomes a natural choice, as it tests at most m possible fast sequences to yield a suboptimal solution.
Even though the divisible load analysis provides complete flexibility in tracing the time behavior of the system, in practice, not all loads are allowed to exploit such a kind of data parallelism. In fact, the load fractions will be integer valued and assume a size that is an integral multiple of some fundamental size, referred to as the granularity in the multiprocessor scheduling literature [5] . We tune our optimal load distribution to fit this requirement in our integer approximation algorithm. By doing this, one can apply the results of divisible load analysis to any data partitioning problem and can have the solution to be as close as the optimal solution by no greater than the sum of the computation and communication times. It may be noted that even though we have derived the upper bound of the suboptimal solution generated by the algorithm for a class of homogeneous networks, the algorithm as such can be applied to a heterogeneous set of processors and we can still obtain an acceptable performance.
CONCLUSIONS
The problem of distributing a divisible load on bus networks is presented rigorously in this paper using a mathematical model that accounts for all the overhead delays that penalize the processing time performance. We have derived a closed-form expression for the processing time, including these overheads, and the effect of these overheads is demonstrated. This was done by assuming that the sequence of load distribution follows a fixed order. A necessary and sufficient condition on the existence of an optimal processing time is derived. The analysis was then extended to the case of homogeneous networks, too. We then introduced the notion of sequencing in bus networks. As mentioned earlier, in the DLT literature, in the case of bus networks, it has been proven that the processing time remains independent of the sequence in which the load is distributed among the processors. However, with the inclusion of overheads, we have demonstrated the effect of sequencing for the case of bus networks, too. To identify an optimal sequence, a number of intermediate results eliciting important characteristics of the behavior of the processing time in the presence of nonzero overhead components were rigorously proven. It was then shown that the processing time is minimized when the load distribution sequence follows the order in which the processor speeds decrease, provided such a sequence exists.
Even though infinite divisibility of the processing load is assumed, in practice, all the data that can be partitioned and processed are in terms of a fundamental quantity called granularity. Our model accounts for this practical constraint by approximating the real valued load fractions generated by the optimal algorithm to integer valued load fractions. To do this, we propose an algorithm that generates integer load fractions for the case of homogeneous networks. With this algorithm, we show that the suboptimal solution generated by this algorithm lies within a radius of not more than the sum of the computation and communication times from the optimal solution. We can also apply this integer approximation algorithm for a heterogeneous network to obtain results that are well within acceptable solution limits. Extensions to this research seem plausible, as one can attempt to design a greedy algorithm that generates a suboptimal solution better than the one suggested in Section 6. With the problem formulation presented in this paper, it would be interesting to conduct a rigorous study on load distribution problems on other networks.
APPENDIX
We present the proofs for Lemmas 2, 3, 4, and 5, and Theorems 2 and 3, stated in Section 4.2.
Proof of Lemma 2. It may be noted that 2 I m 2 I m À IY p m is a concatenation of 2 I m À I and p m . From (7), we obtain, PYmÀI RP Hence, from (33), (34) , and (42), we obtain
Proof of Lemma 3. We shall prove the lemma by using induction. When m P, we have, 2 I P = p I Y p P , 2 P P p P Y p I , and i I`iP . From (7), we obtain 2 I PY P m ai I and 2 P PY P m ai P . Since i I`iP , we obtain 2 I PY P b 2 P PY P. Hence the claim is true. Let the claim be true for m r, which means 2 I rY r b 2 P rY r, where PYr Y RR
