INTRODUCTION
Scheduling problems that arise in optimally distributing the jobs among a set of available processors, with the objective of minimizing the processing time, is an important area of research in computing and communication. The prime objective in this area of research is to design efficient scheduling algorithms that minimize the total processing time [1, 2] . The domain of scheduling *Author to whom all correspondence should be addressed. divisible loads in multiprocessor system was started in 1988 and has stimulated considerable interest among researchers and engineers. A divisible load can be divided into any number of fractions and can be processed independently on the processors, as there are no precedence relationships. The problem of scheduling divisible loads in a linear network incorporating the associated communication delays was first introduced in [3] . In fact, this is the first paper in this area of divisible load scheduling. In this paper, the timing diagram representation of the load distribution process, and the recursive load distribution equations were introduced. The ideas from this paper were extended to scheduling divisible loads in tree networks and bus networks in [4, 5] . In these studies, the optimal load fractions are obtained by assuming all the processors involved in the computation of the load stop computing at the same time instant. In fact, this assumption has been shown to be a necessary and sufficient condition to obtain optimal processing time in linear networks [6], using the concept of processor equivalence, and an analytic proof for bus networks in [7] . However, it has been rigorously shown that this condition is true only in a restricted sense (81, in the case of a heterogeneous single-level tree networks. A closed-form expression for the processing time, for a single-level tree network is presented in [9,10] and using this closed-form expression, optimal sequence and optimal network arrangement are obtained in [9] . For the case of homogeneous linear and tree networks, closedform expression for the processing time and an asymptotic performance analysis are carried out in [11, 12] . A practical application of divisible load scheduling with reference to matrix vector products of very large size presented in [13] shows the usefulness of the analysis.
In this paper, scheduling divisible loads in bus network architecture is considered. It is possible in practical data communication and computing situations to have overheads in communication (&,) and computation (&,) . These overheads occur in communication (&) due to protocol processing delays, unavailability of certain communication resources, and queuing delays, etc., [14, 15] . Similarly, the computation overheads (&,) arise due to delay in extracting the data, processor initialization, etc. [14, 15] . These overheads are almost constant quantities and form as an additive component in load distribution equations [14, 15] . These overheads were considered in literature by some researchers for some specific cases [15-171, such as query processing and image processing applications [15] , and for different architecture [16, 17] . In a recent study [18] , the effect of these 'overheads' in the processing time is presented.
The Contribution of this Paper
With these overhead factors in communication and computation, we first, derive a closed-form expression for the processing time. With our closed-form expression, we can obtain the processing time directly. Using this closed-form expression, we obtain the optimal number of processor and the optimal sequence of load distribution. This paper is organized as follows. Section 2 presents the mathematical modeling and relevant definitions. In Section 3, we present the closed-form expression for the processing time and a comparison with the results obtained in earlier study [18] . Section 4 presents the optimal sequence of load distribution and Section 5 presents the Conclusion. Since this paper presents an alternative approach to the problem dealt with in an earlier study [18] , for convenience, we follow the same notation used in the earlier study [18] .
MATHEMATICAL MODELING AND DEFINITION
The bus network architecture considered in this paper is shown in Figure, l. This network has a dedicated bus controller unit (BCU) to distribute the entire load. The divisible load arrives at the BCU; the BCU divides the load into m fractions ~1, (~2,. . . , om and distributes these load fractions to the m processor in a sequence, pi,ps, . . . , p,, one after another. The processors start computing the load fractions immediately after receiving the load fractions. The objective here is to find the optimal size of these load fractions al, (~2,. . . , o,, such that the processing time is 2. The finish time of processor pi, denoted as Ti(a, m), is the time difference between the instant at which the ith processor stops computing and the time instant at which the BCU initiates the load distribution process.
3. The processing time, denoted as T(cr,m), is the time at which the entire load is processed.
i.e., T(cr, m) = max{?',(a, m), i = 1,2,. . , m}, where Ti is the finish time for processor pz.
4. The optimal processing time, denoted as T*((Y*, m), is the minimum processing time to finish the entire load, i.e., T*(cY*, m) = mina E r{T(a, m)}.
In the literature [8], it has been rigorously proved that for the optimal processing time, all the processors involved in the computation of the processing load must stop computing at the same time instant. In this paper also, we use this optimality criterion.
CLOSED-FORM EXPRESSION FOR THE PROCESSING TIME
Now we shall derive a closed form for the processing time. This is derived by assuming that the sequence of load distribution is from pr,pz, . . . ,p, in that order. This means that the BCU unit distributes the load from processor pr to processor pm one after another. From the timing diagram shown in Figure 2 , the recursive equations for load distribution are
Denoting (,?$+I + C)/E, = fi+i and Ocm/& = ,&, for all i = 1,2,. . . , m -1. Equation (1) can be rewritten as
(2) Now, we see, from the above, there are m -1 linear equations with m variables, and together with the normalization equation, we have m equations. In the earlier study (181, these equations are solved as follows, to obtain the individual load fractions. Each of the (pi in equation (2) is expressed in terms of CY, as where
and a, is obtained as 
Substituting the value of ai in the above equation, T(% m) = (h&b + Nl)(E1 + c) + e,, + e,,,
where CY~, Ml, and Nr are defined as above.
It is shown in [18] that with the inclusion of all these overheads, for optimal processing time, it may not be necessary to use all the m processors in the system. It is shown that there exists a maximum number of processors m* that can be utilized with the given sequence of load distribution. The necessary and sufficient condition for the existence of optimal processing time using all the m processors in a specific order is given by
Alternate Approach
In our (alternate) approach, the value of oi is obtained as follows. Express all the a, (z = 1,2 ,..., m-l)intermsofcr,.
Obtain the value of CY, using the normalization equation. Using this value of LY,, the value of (~1 is obtained as
since or is known, the other load fraction can be obtained as from equation (2 Once m is given, in our approach, ~1 can be directly obtained, and hence, the processing time also can be directly obtained.
In the earlier study, when X(m) < 1, the processing time is obtained as follows. First, the value of Q, is obtained, and then the value of (~1 is obtained using oy,. In our approach, (~1 is directly obtained.
While obtaining the value of al, the necessary and sufficient condition for existence of solution X(m) < 1 is not considered. It is mentioned in [lS], for an m-processor system, there exists an m* (optimal number of processors) beyond which an optimal solution ceases to exist. This is so because once this condition is not satisfied. some of the load fractions will be negative.
In our closed-form expression, this violation of the necessary and sufficient condition is reflected as an increase in the value of (~1. We will now show that, using the closed-form expression obtained in our approach, we can easily prove all the results obtained in the earlier study. First, we show, in our approach, the existence of an optimal number of processors as obtained in [18] . For this purpose, we will write the value of cyi obtained in our approach in the following manner:
is the value of Mi with m processors. We know that Mr(m)/Y(m) decreases with increasing m, and Z(m)/Y( m increases with increasing m. Hence, there is an optimal number ) of processors m*, such that up to the value of m*, the value of cyi will be decreasing, and after that m* the processing time increases in our approach.
It is sufficient to prove the behavior of (I ! to study the behavior of the processing time. Hence, crr(m*) has the following properties:
cq (m*) < cri (m* -1) , a1 (m*) < cq (m* + 1).
From the earlier study, we see that the necessary and sufficient condition, for the existence of optimal processing time, with m* processors in a specific sequence is given by X(m*) < 1. We will now show that the m* obtained in our approach also satisfies this condition, In the earlier study, it is shown that beyond this optimal number of processors m*, an optimal solution ceases to exist. This is because X (m* + k) > 1, for k = 1,2, . . . The reason for this is that some of the load fractions will be negative. In our approach, this fact is obtained as an increase in the processing time. 
This condition is the same as X(m+l) > 1. Hence, al(m) < ar(m+l) only when X(m+l) > 1. From the above two lemmas, we can see that our closed-form expression for or' (and hence. the processing time) has a minimum for an optimal number of processor m' such that a1 (m*) < Lyl (m* -1)) a1 (m*) < Nl (m* + 1).
The processing time will decrease with increase in processors up to m*, and then the processing time is increasing with additional processors. Note that the necessary and sufficient condition given in [18] for the existence of an optimal processing time is satisfied in our approach. So the load fraction assigned to processors in our approach will be the same as the load fractions assigned to the processors in the earlier approach. Hence, we can say that m* is the optimal number of processors only when cyr(m*) < or(m* -1) and cyr(m*) < cq(m* + 1).
Homogeneous System
As a special case, for a homogeneous system wi = w, and hence, Ei = E, for i = 1,2,. , m. we will show the condition on /3, under which m is the optimal number of processors. For this. it is sufficient to consider the value of or. ~l(m _ 1) = fmw2 + p (1 + 2f + 3f2 + . . + (m -2).Ye3) 1+ f + f2 +. '. + f"-2 > (231
crl(m+ 1) = f" +P(1 +2f+3f2
First, we will obtain the condition on p for which or(m) < crr(m -1). Or in other words, 
Now we will prove the condition on ,B for which al(m) < al(m + l), i.e., We now present the numerical results obtained using the speed parameters given in [18] . In our approach, also the processing time is given by
as given in [18] . In our approach, it is sufficient to consider the behaviour of cq, to study the behaviour of the processing time. We know that
It can be seen that q(m) has two components: Table 1 . From Table 1 , we can see for C = 0.4, the processing time decreases up to the optimal number of processors (m* = 6) and then starts increasing. For the case C = 0.2, the processing time decreases up to the optimal number of processors (m* = lo), and then increases. As expected, the optimal number of processors is the same as obtained in [18] . In Figure 3 , the behaviour of processing time with the number of processors is shown for C = 0.2. In Figure 3 , the component of processing time without the overhead, the processing time component because of the overhead, and the total of the two components are shown. Because of the numerical values Ei and &,, the increase in the overhead components is very small with the increase in processors, and hence, the 
Number of Processors decrease and increase in the processing time before and after the optimal number of processors is small. Figure 4 presents the processing time results for the homogeneous network with numerical values EC = 1, for i = 1,2, . . . , m, C = 0.4, I&, = 0.1, and &, = 0.02. In this figure, we can see that the increase in overhead components is not small (as in the heterogeneous case), and hence, the behaviour of.processing time before and after the optimal number of processors is more clear.
It is important to note here the following: we are not using this closed-form expression to obtain the load fraction beyond the optimal number of processors m*. It is mentioned in [18] that, beyond this m*, the optimal solution ceases to exist. This is because some of the load fractions will be negative. This fact that some of the load fractions are negative is reflected in our approach as an increase in the values of al.
CONCEPT OF SEQUENCING
The advantage of our closed-form expression is that this can be directly used to obtain the optimal sequence of load distribution. For the sake of clarity, we will first illustrate the optimal sequence for the case with m = 3 and then generalize the result. We also assume that X(3) < 1. The value of (~1, for a given sequence of load distribution, is
We will rewrite this above (~1 expression in terms of Ei (i = 1,2,3) and 0,, as Note here in the above expression, the sequence of load distribution is (pr, ps, ps), i.e., the BCU first sends the load fraction to processor pr (speed El), next to processor ps (speed Es), and last, to processor ps (speed E3). Let the BCU change the sequence of load distribution to (~1 ,p3,p2), i.e., first send the load fraction to processor pl (speed El), next to processor p3 (speed Es), and last, to processor ps (speed Ez). We will denote the value of err for this sequence as ~'1. Note that cyi can be obtained by interchanging E2 and E3 in the earlier expression and is obtained as 
We have to find the condition for which ~1 < ai. The denominators of the or and (ri are the same. Also, the first term in the numerator of al and oi are the same. Hence,
where D is the denominator of oi or or. From this, we can say that the processing time for the sequence (pr ,pz,p3) is less then or equal to the processing time for the sequence (pl,p3,p2) only when E2 is less than or equal to E3.
Generalization
For m processors, consider the BCU distribute the load fraction to the processors in the follow- 
When we interchange Ei and Ei+l in the above expression only fi+2, fi+r, fi will change. The changed values are denoted as gi+2, gs+i, and gi defined as follows: LEMMA 3. The processing time for the sequence (~1, pz, . . . , pi, pi+1 , . . . , p,) is less than or equal to the processing time for the sequence (~1, ~2,. . . ,pi+l,pi, . . . ,pm) only when Ei 5 E~+I. The concept of sequencing proposes a method by which the minimum processing time can be achieved. However, we have not included the first processor in the concept of sequencing, i.e., in the interchange argument i = 2,3, . . . , m -1. Now, we will prove the speed condition on the first processor. For this purpose, we consider a bus network with only two processors, pr (speed El) and pz (speed E2). DISTRIBUTION (pl,p2) . Let T(a, 2) be the processing time for this sequence of load distribution and is obtained as T(cr, 2) = ; ',"E+$;
CASE (i). SEQUENCE OF LOAD
(4 + C) + e,, + ecm. where D is the denominator of T(a', 2) (or T(a, 2)).
This reduces to T(a, 2) -T(a', 2) = E2 +$ + &% -4%).
Hence, T(a, 2) 5 T(cY', 2) only when El 5 E2. From here, we can say the first processor should be the fastest. Note that, to find the speed condition of the first processor, we have to use the processing time expression. For the speed condition of other processors, it is sufficient to consider the value of the al expression rather than the processing time expression. Though we have chosen only two processors to prove the condition on speed of the first processor, for an m-processor system this can be easily proved in a similar fashion, as done for a single-level tree network in Lemma 7.3 given in [8] .
In the earlier study [18] , the fast sequence is defined as the sequence (pr , ~2, . . . , pi, pi+l, , pm) such that Ei < Ei+l for all i = 1,2,. . . , m -1. For an m-processor system, m! different load distribution sequences are possible. It is possible in this analysis to have a nonoptimal sequence of load distribution and use additional processors. For example, let the optimal number of processors for an m-processor system, using an optimal sequence, be m*, and the value of or for this is or(m*).
Let the optimal number of processors for the same m-processor system using a nonoptimal sequence be m* + Ic, and the value of or for this is ai(m* + Ic). Here, because the sequence is nonoptimal, we can use more processors. We have to prove that ai < ai(m* + k), i.e., we have to prove that the processing time with optimal sequence and optimal number of processors is less than the processing time with nonoptimal sequence and the corresponding optimal number of processors (for this nonoptimal sequence). By rearranging, the nonoptimal sequence, using the sequencing analysis, we can obtain the optimal sequence. Let the value of (~1 obtained after rearrangement be ar(m* + k). Based on the sequencing analysis, we know that the value of 01 with an optimal sequence is less than the value of (~1 with a nonoptimal sequence, i.e., cq(m*+k)<a~(m*+k). Now, we know that Lyr(m*) and or(m* + Ic) are obtained using an optimal sequence. From Lemmas 1 and 2, we know that for any given sequence of load distribution cri(m*) < crr(m* + Ic), and hence, or (m*) < CX~ (m* + Ic). Based on the above analysis, we can state the following lemma.
