The effect of start-up delays in scheduling divisible loads on bus networks: An alternate approach  by Suresh, S. et al.
Available at An International Journal 1 
www.ElsevierMathematlcs.com 




m Computers and Mathematics with Applications 46 (2003) 1545-1557 
www.elsevier.com/locate/camwa 
The Effect of Start-Up Delays in 
Scheduling Divisible Loads on Bus 
Networks: An Alternate Approach 
S. SURESH, V. MANI* AND S. N. OMKAR 
Department of Aerospace Engineering, Indian Institute of Science 
Bangalore 560 012, India 
mani@aero.iisc.ernet.in 
(Received December 2001; accepted January 2003) 
Abstract-In this paper, scheduling of divisible loads in a bus network is considered. The objec- 
tive is to minimize the processing time by including the overhead component due to start-up time 
that could degrade the performance of the system, in addition to the inherent communication and 
computation delays. These overheads are considered to be constant additive factors to the com- 
munication and computation components. A closed-form expression for optimal processing time is 
derived. Using this closed-form expression, this paper analytically proves significant results regarding 
the optimal sequence of load distribution and optimal number of processors. Numerical examples are 
presented to illustrate the analysis. @ 2003 Elsevier Ltd. All rights reserved. 
Keywords-Divisible loads, Communication delay, Processing time, Optimal sequence, Bus net- 
works. 
NOMENCLATURE 
load fraction assigned to processor pi z the inverse of the communication 
the inverse of the computation speed speed of the link 
of processor pi T cm time taken to transmit a unit load by 
time taken to process a unit load by the communication link 
the standard processor 6 cm a constant additive communication 
a constant additive computation overhead component that includes the 
overhead component that includes the sum of all delays associated with the 
sum of all delays associated with the communication process 
computation process 
1. INTRODUCTION 
Scheduling problems that arise in optimally distributing the jobs among a set of available 
processors, with the objective of minimizing the processing time, is an important area of research 
in computing and communication. The prime objective in this area of research is to design efficient 
scheduling algorithms that minimize the total processing time [1,2]. The domain of scheduling 
*Author to whom all correspondence should be addressed. 
0898-1221/03/L - see front matter 0 2003 Elsevier Ltd. All rights reserved. 
doi: 10.1016/S0898-1221(03)00382-l 
Typeset by AM-T@ 
1546 
divisible loads in multiprocessor system was started in 1988 and has stimulated considerable 
interest among researchers and engineers. 
A divisible load can be divided into any number of fractions and can be processed independently 
on the processors, as there are no precedence relationships. The problem of scheduling divisible 
loads in a linear network incorporating the associated communication delays was first introduced 
in [3]. In fact, this is the first paper in this area of divisible load scheduling. In this paper, the 
timing diagram representation of the load distribution process, and the recursive load distribution 
equations were introduced. The ideas from this paper were extended to scheduling divisible loads 
in tree networks and bus networks in [4,5]. In these studies, the optimal load fractions are 
obtained by assuming all the processors involved in the computation of the load stop computing 
at the same time instant. In fact, this assumption has been shown to be a necessary and sufficient 
condition to obtain optimal processing time in linear networks [6], using the concept of processor 
equivalence, and an analytic proof for bus networks in [7]. However, it has been rigorously shown 
that this condition is true only in a restricted sense (81, in the case of a heterogeneous single-level 
tree networks. A closed-form expression for the processing time, for a single-level tree network is 
presented in [9,10] and using this closed-form expression, optimal sequence and optimal network 
arrangement are obtained in [9]. For the case of homogeneous linear and tree networks, closed- 
form expression for the processing time and an asymptotic performance analysis are carried out 
in [11,12]. A practical application of divisible load scheduling with reference to matrix vector 
products of very large size presented in [13] shows the usefulness of the analysis. 
In this paper, scheduling divisible loads in bus network architecture is considered. It is possi- 
ble in practical data communication and computing situations to have overheads in communica- 
tion (&,) and computation (&,). These overheads occur in communication (&) due to protocol 
processing delays, unavailability of certain communication resources, and queuing delays, etc., 
[14,15]. Similarly, the computation overheads (&,) arise due to delay in extracting the data, 
processor initialization, etc. [14,15]. These overheads are almost constant quantities and form as 
an additive component in load distribution equations [14,15]. These overheads were considered 
in literature by some researchers for some specific cases [15-171, such as query processing and 
image processing applications [15], and for different architecture [16,17]. In a recent study [18], 
the effect of these ‘overheads’ in the processing time is presented. 
1.1. The Contribution of this Paper 
With these overhead factors in communication and computation, we first, derive a closed-form 
expression for the processing time. With our closed-form expression, we can obtain the processing 
time directly. Using this closed-form expression, we obtain the optimal number of processor and 
the optimal sequence of load distribution. 
This paper is organized as follows. Section 2 presents the mathematical modeling and relevant 
definitions. In Section 3, we present the closed-form expression for the processing time and 
a comparison with the results obtained in earlier study [18]. Section 4 presents the optimal 
sequence of load distribution and Section 5 presents the Conclusion. Since this paper presents 
an alternative approach to the problem dealt with in an earlier study [18], for convenience, we 
follow the same notation used in the earlier study [18]. 
2. MATHEMATICAL MODELING AND DEFINITION 
The bus network architecture considered in this paper is shown in Figure,l. This network has 
a dedicated bus controller unit (BCU) to distribute the entire load. The divisible load arrives at 
the BCU; the BCU divides the load into m fractions ~1, (~2,. . . , om and distributes these load 
fractions to the m processor in a sequence, pi,ps, . . . , p,, one after another. The processors start 
computing the load fractions immediately after receiving the load fractions. The objective here 
is to find the optimal size of these load fractions al, (~2,. . . , o,, such that the processing time is 
The Effect of Start-Up Delays 
( BusControll;Unit(BCU) ( 
1 I 
pGikq [-] .,,,,,.....,......,,........,..,.,.,,,...,,., *] 
Figure 1. Distributed bus architecture with m processors and a BCU 









i -‘-. . . . . . .._._ . . . . j 
_. 
%J5,,+0, 
Figure 2. Timing diagram for processing the divisible load with m processors 
a minimum. As in the earlier study [ES], here we denote wiT,, as Ei and zT,, as C, and hencr. 
the communication time for the load fraction CQ is c& + Qcm and the computation time for the 
load fraction ai is aiEi + &,. The timing diagram for the load distribution process is shown in 
Figure 2. 
1. The load distribution, denoted as (Y, defined as an m-tuple ((~1, (~2,. . , cu,) such that 0 < 
CY~ 5 1 and cE”=, LYE = 1. The equation CL1 CY( = 1 is normalization equation, and the space of 
all possible load distribution is denoted as r. 
2. The finish time of processor pi, denoted as Ti(a, m), is the time difference between the instant 
at which the ith processor stops computing and the time instant at which the BCU initiates the 
load distribution process. 
3. The processing time, denoted as T(cr,m), is the time at which the entire load is processed. 
i.e., T(cr, m) = max{?‘,(a, m), i = 1,2,. . , m}, where Ti is the finish time for processor pz. 
4. The optimal processing time, denoted as T*((Y*, m), is the minimum processing time to finish 
the entire load, i.e., T*(cY*, m) = mina E r{T(a, m)}. 
In the literature [8], it has been rigorously proved that for the optimal processing time, all the 
processors involved in the computation of the processing load must stop computing at the same 
time instant. In this paper also, we use this optimality criterion. 
1548 S. SURESH et al. 
3. CLOSED-FORM EXPRESSION FOR THE PROCESSING TIME 
Now we shall derive a closed form for the processing time. This is derived by assuming that 
the sequence of load distribution is from pr,pz, . . . ,p, in that order. This means that the BCU 
unit distributes the load from processor pr to processor pm one after another. From the timing 
diagram shown in Figure 2, the recursive equations for load distribution are 
ctg?q + ecp = %+1(& + C) + &, + &II, i=1,2 ,..‘, m-l. (1) 
Denoting (,?$+I + C)/E, = fi+i and Ocm/& = ,&, for all i = 1,2,. . . , m - 1. Equation (1) can 
be rewritten as 
% = Qi+d+1 + Pi, i= 1,2 ,“‘) m-1. (2) 
Now, we see, from the above, there are m - 1 linear equations with m variables, and together 
with the normalization equation, we have m equations. In the earlier study (181, these equations 
are solved as follows, to obtain the individual load fractions. Each of the (pi in equation (2) is 
expressed in terms of CY, as 
where 
CY~ = a,Mi + Ni, i=1,2 ,..., m-l, (3) 
Mi = fi fj, 
j=i+l 
m-l 
Ni = C Pp 
p=i 
i=1,2 ,..., m-l, 
i=1,2 ,..., m-l, 
and a, is obtained as 
QI - 
1 - X(m) 




x(m) = C C PP 
i=l p=i 
The expression for the processing time is obtained as 
T(a, 7-n) = cul(E~ + C) + &, + f%,. (5) 
Substituting the value of ai in the above equation, 
T(% m) = (h&b + Nl)(E1 + c) + e,, + e,,, (6) 
where CY~, Ml, and Nr are defined as above. 
It is shown in [18] that with the inclusion of all these overheads, for optimal processing time, 
it may not be necessary to use all the m processors in the system. It is shown that there 
exists a maximum number of processors m* that can be utilized with the given sequence of load 
distribution. The necessary and sufficient condition for the existence of optimal processing time 
using all the m processors in a specific order is given by 
m-l m-l 
X(m) = C C Pp fJ fi 41. 
i=l p=i ( ) j=i+l 
(7) 
The Effect of Start-Up Delays 1549 
Alternate Approach 
In our (alternate) approach, the value of oi is obtained as follows. Express all the a, (z = 
1,2 ,..., m-l)intermsofcr,. Obtain the value of CY, using the normalization equation. Using 
this value of LY,, the value of (~1 is obtained as 
cyl = Ml + Z(m) 
Y(m) ’ 
(8’ 
since or is known, the other load fraction can be obtained as from equation (2). Hence. processing 
time is 
where 
and the value of fi = 1. The processing time obtained in our approach is the same as the 
processing time obtained in the earlier approach for a given sequence of load distribution. We 
see in the processing time expression, Mr, Z(m), and Y(m) are functions of m, the number 
of processors. Once m is given, in our approach, ~1 can be directly obtained, and hence, the 
processing time also can be directly obtained. In the earlier study, when X(m) < 1, the processing 
time is obtained as follows. First, the value of Q, is obtained, and then the value of (~1 is obtained 
using oy,. In our approach, (~1 is directly obtained. While obtaining the value of al, the necessary 
and sufficient condition for existence of solution X(m) < 1 is not considered. It is mentioned 
in [lS], for an m-processor system, there exists an m* (optimal number of processors) beyond 
which an optimal solution ceases to exist. This is so because once this condition is not satisfied. 
some of the load fractions will be negative. In our closed-form expression, this violation of the 
necessary and sufficient condition is reflected as an increase in the value of (~1. We will now 
show that, using the closed-form expression obtained in our approach, we can easily prove all the 
results obtained in the earlier study. 
First, we show, in our approach, the existence of an optimal number of processors as obtained 
in [18]. For this purpose, we will write the value of cyi obtained in our approach in the following 
manner: 
cyl = W(m) + Z(m) 
Y(m) 
!lO) 
Ml(m) is the value of Mi with m processors. We know that Mr(m)/Y(m) decreases with 
increasing m, and Z(m)/Y( m increases with increasing m. Hence, there is an optimal number ) 
of processors m*, such that up to the value of m*, the value of cyi will be decreasing, and after 
that m* the processing time increases in our approach. It is sufficient to prove the behavior of (I ! 
to study the behavior of the processing time. Hence, crr(m*) has the following properties: 
cq (m*) < cri (m* - 1) , 
a1 (m*) < cq (m* + 1). (11) 
From the earlier study, we see that the necessary and sufficient condition, for the existence of 
optimal processing time, with m* processors in a specific sequence is given by X(m*) < 1. We 
will now show that the m* obtained in our approach also satisfies this condition, 
LEMMA 1. Consider an m-processor system with a fixed sequence of load distribution as pi, p2: 
.‘.,Prn. Also consider an (m - l)-processor system comprising of processors pi,ps, . ,pmPl 
following the same sequence of load distribution as the above-mentioned m-processor system. 
Let the value of err for these two systems be al(m) and al(m - l), respectively. In this situation. 
1550 s. SURESH et al. 
al(m) < crl(m - 1) only when X(m) < 1. Or in other words, the processing time for the 
m-processor system is less than the processing time for the (m - 1)-processor system, only when 
X(m) < 1. 
PROOF. The value of al(m) and crl(m - 1) are as follows: 
al(m) = Ml(m) + Z(m) 
Y(m) ’ 
cyl(m _ 1) = W(m - 1) + am - 1) 
Y(m-1) . 
We have to obtain the condition under which al(m) - al(m - 1) < 0. Or in other words, 
[(N(m) + Z(m)) Y(m - 1) - (M1(m - 1) + Z(m - 1)) Y(m)] < 0, 
where D = Y(m)Y(m - 1). This above expression reduces to 
(14 
(Z(m)Y(m - 1) - Z(m - l)Y(m)) < (Ml(rn - l)Y(m) - M~(rn)Y(rn - 1)). (15) 
This can be further simplified as 
Pl + P2Yc4 + . . . + Pm-lY(rn - 1) < 1. (16) 
This above equation is the same as X(m) < 1. Hence, al(m) < (~l(rn - 1) only when X(m) < 1. 
In the earlier study, it is shown that beyond this optimal number of processors m*, an optimal 
solution ceases to exist. This is because X (m* + k) > 1, for k = 1,2, . . . The reason for this 
is that some of the load fractions will be negative. In our approach, this fact is obtained as an 
increase in the processing time. 
LEMMA 2. Consider an (m + l)-processor system with a fixed sequence of load distribution as 
Pl,P2,“‘,Pm+l* Also consider an m-processor system comprising of processors ~1, ~2, . . . , pm, 
following the same load distribution as the above-mentioned (m + l)-processor system. Let the 
value of al for these two systems be crl(m + 1) and al(m), respectively. In this situation, 
w(m) < al(m + 1) only when X (m + 1) > 1. Or in other words, the processing time for the 
(m + 1)-processor system is more than the processing time of the m-processor system only when 
X(m+l)>l. 
PROOF. The value of al(m + 1) and al(m) are as follows: 
al(m + 1) = Ml(m + 1) + 4m + 1) 
Y(m+l) ’ 
al(m) = Ml(m) + Z(m) 
Y(m) ’ 
We have to obtain the condition under which al(m) - Ql(rn + 1) < 0. Or in other words, 
+ [{Ml(m) + z(m)) Y(m + 1) - {Ml(m + 1) + Z(m + 1)) Y(m)] < 0, 
where D = Y(m)Y(m + 1). The above expression reduces to 
{Z(m)Y(m + 1) - Z(m + l)Y(m)} 5 {Ml(m + l)Y(m) - Ml(m)Y(m + 1)). (20) 
This can be further simplified as 
PI + P2Y(2) + . . + t&Y(m) > 1. (21) 
The Effect of Start-Up Delays 1551 
This condition is the same as X(m+l) > 1. Hence, al(m) < ar(m+l) only when X(m+l) > 1. 
From the above two lemmas, we can see that our closed-form expression for or’ (and hence. the 
processing time) has a minimum for an optimal number of processor m’ such that 
a1 (m*) < Lyl (m* - 1)) 
a1 (m*) < Nl (m* + 1). 
(22) 
The processing time will decrease with increase in processors up to m*, and then the processing 
time is increasing with additional processors. Note that the necessary and sufficient condition 
given in [18] for the existence of an optimal processing time is satisfied in our approach. So 
the load fraction assigned to processors in our approach will be the same as the load fractions 
assigned to the processors in the earlier approach. Hence, we can say that m* is the optimal 
number of processors only when cyr(m*) < or(m* - 1) and cyr(m*) < cq(m* + 1). 
Homogeneous System 
As a special case, for a homogeneous system wi = w, and hence, Ei = E, for i = 1,2,. , m. 
we will show the condition on /3, under which m is the optimal number of processors. For this. 
it is sufficient to consider the value of or. 
~l(m _ 1) = fmw2 + p (1 + 2f + 3f2 + . . + (m - 2).Ye3) 
1+ f + f2 +. ‘. + f”-2 > 
(231 
rrl(m) = f”-l + P (1 + 2f + 3f2 + + (m - l).Pe2) 
l+f+fs+~~~+fm-r , (24) 
crl(m+ 1) = f” +P(1 +2f+3f2 +-+w-‘1 
1+j+f2+...+frn (25) 
First, we will obtain the condition on p for which or(m) < crr(m - 1). Or in other words, 
{f”-l+P(1+2f +3f2+~~~+(m-l)fm-2)} (l+f+f2+...+fm-2) 
-{j”-2+p(1+2f+...+(m-2)fm-3)} (1+f+f2+...+fm-l) <O’ (26) 1 
where D is the product of denominators of or(m) and crr(m - 1). Following the same manner as 
in Lemma 1, this reduces to 
P [(m - 1) + (m - 2)f + (m - 3)f2 + ‘. + f”-“J < 1. (27) 
Now we will prove the condition on ,B for which al(m) < al(m + l), i.e., 
{.f”-1+~(1+2f+3f2+~~~+(m-l)fm-2)} (1+f+f2+...+fm) 
-{f”+~(l+2f+3f2+~~~+mfm-1)} (1+f+f2+...+fm-I) <” (28) 1 
Here D is the product of the denominator of or(m) and ar(m + 1). This expression reduces to 
P (m + (m - 1)f + + f,-‘) > 1. (29) 
From the above two equations, we can say m is the optimal number of processors only when 







S. SURESH et al. 
We now present the numerical results obtained using the speed parameters given in [18]. In 
our approach, also the processing time is given by 
T(% m) = a1(E1+ C) + ec, + Bcp, (32) 
as given in [18]. In our approach, it is sufficient to consider the behaviour of cq, to study the 
behaviour of the processing time. We know that 
al(m) = W(m) + ZCrn) 
Y(m) . 
(33) 
It can be seen that q(m) has two components: 
6) M~(m)lY(m)- corn P onent of cq(m) without overheads; 
(ii) z(m)ly( m > -component of q(m) due to overheads. 
We know that Mi(m)/Y(m) d ecreases with increasing m and Z(m)/Y(m) increases with in- 
creasing m. Similarly, the processing time also has two components. With the speed parameters, 
given in [18], the processing time obtained for C = 0.4 and C = 0.2 are given in Table 1. From 
Table 1, we can see for C = 0.4, the processing time decreases up to the optimal number of 
processors (m* = 6) and then starts increasing. For the case C = 0.2, the processing time de- 
creases up to the optimal number of processors (m* = lo), and then increases. As expected, 
the optimal number of processors is the same as obtained in [18]. In Figure 3, the behaviour of 
processing time with the number of processors is shown for C = 0.2. In Figure 3, the component 
of processing time without the overhead, the processing time component because of the overhead, 
and the total of the two components are shown. Because of the numerical values Ei and &,, the 
increase in the overhead components is very small with the increase in processors, and hence, the 
Table 1. Processing time with number of processors. 
Number of Processors Processing Time Processing Time c = 0.4 c = 0.2 
I 1 I 0.7370000 I 0.521000 1 
I 2 I 0.4884439 1 1 0.307428 I 
3 0.4345488 0.244888 
4 0.4291981 0.237428 
5 0.4275042 0.234157 
I 6 I 0.426954' 1 0.232272 1 
7 0.4269693 0.231183 
8 0.230253 
9 0.230021 






Speed parameters form [18]. 
e e cp cm El E2 ~93 E4 Es & ET Es Es EIO El1 El2 E13 E14 E15 
0.02 0.001 0.3 0.2 0.1 0.4 0.6 0.7 0.8 0.5 0.9 1.1 1.3 1.0 0.6 0.5 0.3 
The Effect of Start-Up Delays 
Heteroaeneous Processors 
I  ’ c = 0.2 I 












Thetacp = 0.02 
m* - optimal number of processors 
\Ml(m) /Y(m) 
A z(m) / Y(m) 
-.-.*-.c .-* ,-,-, * _,eI_ * ,_,-, * ,_,-._) .-.- *-.- -K -,-. 
* _,-.- * ,-.-, * ,...-.- k -,-, *-‘-’ 
Number of Processors 














I I I / 
Thetacm = 0.1 
Thetacp = 0.02 









-,3’: -,-, l ,-, 
.Jf 










Number of Processors 
I / 
11 13 15 
Figure 4. Variation of processing time with number of processors (homogeneous 
case). 
1554 S. SURESH et al. 
decrease and increase in the processing time before and after the optimal number of processors is 
small. Figure 4 presents the processing time results for the homogeneous network with numerical 
values EC = 1, for i = 1,2, . . . , m, C = 0.4, I&, = 0.1, and &, = 0.02. In this figure, we can see 
that the increase in overhead components is not small (as in the heterogeneous case), and hence, 
the behaviour of.processing time before and after the optimal number of processors is more clear. 
It is important to note here the following: we are not using this closed-form expression to 
obtain the load fraction beyond the optimal number of processors m*. It is mentioned in [18] 
that, beyond this m*, the optimal solution ceases to exist. This is because some of the load 
fractions will be negative. This fact that some of the load fractions are negative is reflected in 
our approach as an increase in the values of al. 
4. CONCEPT OF SEQUENCING 
The advantage of our closed-form expression is that this can be directly used to obtain the 
optimal sequence of load distribution. For the sake of clarity, we will first illustrate the optimal 
sequence for the case with m = 3 and then generalize the result. We also assume that X(3) < 1. 
The value of (~1, for a given sequence of load distribution, is 
We will rewrite this above (~1 expression in terms of Ei (i = 1,2,3) and 0,, as 
(~93 + c) (~92 + C) + 8,” (2E2 + E3 + 2C) 
a1 = E&I + El (E3 + C) + (E3 + C) (Ed + c) 
(34) 
(35) 
Note here in the above expression, the sequence of load distribution is (pr, ps, ps), i.e., the BCU 
first sends the load fraction to processor pr (speed El), next to processor ps (speed Es), and last, 
to processor ps (speed E3). Let the BCU change the sequence of load distribution to (~1 ,p3,p2), 
i.e., first send the load fraction to processor pl (speed El), next to processor p3 (speed Es), and 
last, to processor ps (speed Ez). We will denote the value of err for this sequence as ~‘1. Note 
that cyi can be obtained by interchanging E2 and E3 in the earlier expression and is obtained as 
(E2 + c) (E3 + c) + e,, (2E3 + E2 + 2C) 
a’= E~J%+EI(&+C)+(E~+C)(E~+C) 
(36) 
We have to find the condition for which ~1 < ai. The denominators of the or and (ri are the 
same. Also, the first term in the numerator of al and oi are the same. Hence, 
aI-cu;=+{2E2+E3+2C-2E3-E2-2C}= &m (E2 - E3) D , (37) 
where D is the denominator of oi or or. From this, we can say that the processing time for the 
sequence (pr ,pz,p3) is less then or equal to the processing time for the sequence (pl,p3,p2) only 
when E2 is less than or equal to E3. 
Generalization 
For m processors, consider the BCU distribute the load fraction to the processors in the follow- 
iwwuence: (PI,P~,P~~...~P~, Pi+i,..., pm). Let X(m) < 1. The value of (~1 for this sequence 
denoted by ~1 (m) is I -. , 
al(m) = Ml (ml f Z(m) 
Y(m) ’ 
The Effect of Start-Up Delays 1555 , , 
Consider another sequence of load distribution by the BCU to the processors as (pi, pz, ~3,. 
Pz+l,Pz, . . , pm). Let X(m) < 1. The value of oi and for this load distribution denoted as 0’1 (mj 
is 




o:(m) can be obtained by interchanging Ei and Ei+l in cri(m). Because of this interchange, only 
.f+2, fi+l, fi, Pi, Pi+1 will change. The other quantities will not change. Note that, because 
of this interchange, Ml(m) and Ml(m) will not change. Also because of this interchange, Y(m) 
and Y’(m) also will not change, i.e., 
Ml(m) = W(m), 
Y(m) = Y’(m). 
(401 
This above fact can be verified from the optimal sequence Lemma 7.1 given in [8] for a single-level 
tree network with r = 1. In Lemma 7.1, r = 1 implies that link speeds are the same as in the 
case for the bus network, and this is also shown to be true in Theorem 7.2 for a single-level tree 
network given in [8]. 
Now we have to find the condition for which ai(rn) 5 o;(m). This is the same as to find 
the condition for which Z(m) < Z’(m). We know that Z(m) and Z’(m) are functions of pi. 
ps, , Pm-i. So we consider this term-by-term. /3i terms in Z(m) are 
6 
-[ G +fmfm-1 
1+ fm + fmfm-1+ + fmfm-1’ fi+2 
’ ’ ’ fi+Zfi+l + fmfm-1 ’ ’ fi+Zfi+lfi + ’ + fmfm-1 ’ f4f3 I 
(41) 
When we interchange Ei and Ei+l in the above expression only fi+2, fi+r, fi will change. The 
changed values are denoted as gi+2, gs+i, and gi defined as follows: 
Qi+2 = 
Ei+z + C 
Ei ’ 
E, +C 
Si+l = E ’ (42) 2+1 
Lli = 
&+I + C 
-&-I 
,& terms in Z’(m) are obtained by replacing f2+2 by gi+z, fi+l by gi+i, and fi by gi in ,3i terms 
of Z(m) and is obtained as 
0 cm 
-4 
1+ fm +fmfm-1 +'~'+fmfm-l"~gi+2+ 
El +fmfm-1. . . %+2%+1 + fmfm-1.. ' Qi+2%+lLli + '. + frnfm-1 ‘SJi+ZTlz+lLii ‘. . f3 1 (43) 
Note that fi+zfi+lfi = gi+zgi+igi. Hence, from the /3i terms in Z(m) and Z’(m), we get the 
contribution of ,Or terms in Z(m) - Z’(m) as 
2 [fmfm-1 . . . fi+3 (fif2 + fi+2fi+3 - L?i+2 + %+2.4%+3)] (44) 
The value of pi terms in Z(m) - Z'(m) is zero since 
fi+Z + fi+Zfi+l - %+2 + Qi+ZQi+l = 0. (45) 
In a similar way, it can be easily shown that all /3j terms (j = 1,2, . , m - 1, and j # i) vanishes, 
except pi terms in the expression Z(m) - Z'(m). 0 r in other words, only pi terms in Z(m) and 
Z'(m) will have a nonzero value in Z(m) - Z’(m). Hence, Z(m) - Z’(m) is obtained as 
Z(m)- Z'(m) = K(Ei - Ei+l), (461 
where 
K = EmEm- . ..Ei+~(Ei-l + c)(Ei-2 + c). . . (E2 + c)s 
E,E,,,-I...E~ Crn~ 
Hence, al(m) 5 o:(m) only when Ei < Ei+l. Based on this generalization, we state the following 
lemma. 
1556 s. SVRESH et, al. 
LEMMA 3. The processing time for the sequence (~1, pz, . . . , pi, pi+1 , . . . , p,) is less than or equal 
to the processing time for the sequence (~1, ~2,. . . ,pi+l,pi, . . . ,pm) only when Ei 5 E~+I. 
The concept of sequencing proposes a method by which the minimum processing time can be 
achieved. However, we have not included the first processor in the concept of sequencing, i.e., in 
the interchange argument i = 2,3, . . . , m - 1. Now, we will prove the speed condition on the first 
processor. For this purpose, we consider a bus network with only two processors, pr (speed El) 
and pz (speed E2). 
CASE (i). SEQUENCE OF LOAD DISTRIBUTION (pl,p2). Let T(a, 2) be the processing time for 
this sequence of load distribution and is obtained as 
T(cr, 2) = ; ‘,“E+$; (4 + C) + e,, + ecm. 
2 1 
(47) 
CASE (ii). SEQUENCE OF LOAD DISTRIBUTION (p2,p1). Let T(a’, 2) be the processing time for 
this sequence of load distribution and is obtained as 
T(a’, 2) = :2’+;;$; (E2 + C) + e,, + ecm. (48) 
The denominators of T(cx’, 2) and T(a, 2) are the same. We obtain the condition on T(o, 2) - 
T(a’,2) as 
T(a, 2) - T(a’,2) = ; {(Ez + C + O,,) (EI + C) - (EI + C + &,) (E2 + C)} , 
where D is the denominator of T(a’, 2) (or T(a, 2)). 
(49) 
This reduces to 
T(a, 2) - T(a’, 2) = E2 +$ + &% - 4%). (50) 
Hence, T(a, 2) 5 T(cY’, 2) only when El 5 E2. From here, we can say the first processor should 
be the fastest. Note that, to find the speed condition of the first processor, we have to use the 
processing time expression. For the speed condition of other processors, it is sufficient to consider 
the value of the al expression rather than the processing time expression. Though we have chosen 
only two processors to prove the condition on speed of the first processor, for an m-processor 
system this can be easily proved in a similar fashion, as done for a single-level tree network in 
Lemma 7.3 given in [8]. 
In the earlier study [18], the fast sequence is defined as the sequence (pr , ~2, . . . , pi, pi+l, , pm) 
such that Ei < Ei+l for all i = 1,2,. . . , m - 1. For an m-processor system, m! different load 
distribution sequences are possible. It is possible in this analysis to have a nonoptimal sequence 
of load distribution and use additional processors. For example, let the optimal number of 
processors for an m-processor system, using an optimal sequence, be m*, and the value of or for 
this is or(m*). 
Let the optimal number of processors for the same m-processor system using a nonoptimal 
sequence be m* + Ic, and the value of or for this is ai(m* + Ic). Here, because the sequence is 
nonoptimal, we can use more processors. We have to prove that ai < ai(m* + k), i.e., we 
have to prove that the processing time with optimal sequence and optimal number of processors 
is less than the processing time with nonoptimal sequence and the corresponding optimal number 
of processors (for this nonoptimal sequence). By rearranging, the nonoptimal sequence, using 
the sequencing analysis, we can obtain the optimal sequence. Let the value of (~1 obtained 
after rearrangement be ar(m* + k). Based on the sequencing analysis, we know that the value 
of 01 with an optimal sequence is less than the value of (~1 with a nonoptimal sequence, i.e., 
cq(m*+k)<a~(m*+k). 
Now, we know that Lyr(m*) and or(m* + Ic) are obtained using an optimal sequence. From 
Lemmas 1 and 2, we know that for any given sequence of load distribution cri(m*) < crr(m* + Ic), 
and hence, or (m*) < CX~ (m* + Ic). Based on the above analysis, we can state the following lemma. 
The Effect of Start-Up Delays 1.55; 
LEMMA 4. The optimal processing time is the processing time obtained using an optimal sequencr 
of load distribution with an optimal number of processors. 
CONCLUSIONS 
The effect of start-up in scheduling divisible loads on a bus network is considered and an 
alternate approach is presented to obtain the processing time. In the earlier approach [18]! first 
the value of (Y, is obtained (using the necessary and sufficient condition X(m) < l), and then the 
value of al and the processing time are obtained. In our approach presented in this paper, a direct. 
closed-form expression for the value of cyl, and hence, the processing time is presented. It is also 
proved that the optimal number of processors obtained, using this closed-form expression, satisfies 
the necessary and sufficient conditions presented in [18]. Using this closed-form expression, WC 
prove important results in sequencing. It is proved analytically, in this paper that, for a bus 
network sharing a divisible load with start-up delays, the optimal processing time is obtained 
using optimal sequence of load distribution with optimal number of processors. 
REFERENCES 
1. S.H. Bokhari, Assignment Problems in Parallel and Distributed Computing, Kluwer Academic, Boston. MA. 
(1987). 
2. M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. 
W.H. Freeman, New York, (1979). 
3. Y.C. Cheng and T.G. Robertaizi, Distributed computation with communication delays, IEEE nuns.. Aero- 
space and Electronics Systems 24 (6), 70&712, (1988). 
4. Y.C. Cheng and T.G. Robertazzi, Distributed computation for a tree network with communication delays. 
IEEE tins., Aerospace and Electronic Systems 26 (3), 511-516, (1990). 
5. S. Bataineh and T.G. Robertazzi, Bus oriented load sharing for a network of sensor driven processors. IEEE 
nuns., System, Man, Cybernetics 21 (5), 1202-1205, (1991). 
6. T.G. Robertazzi, Processor equivalent for a linear daisy chain of load sharing processors, IEEE Trans 
Aerospace and Electronic Systems 29 (4), 1216-1221, (1993). 
7. J. Sohn and T.G. Robertazzi, Optimal divisible job load sharing on bus network, IEEE Trans., Aerospace 
and Electronic Systems 32 (l), 34-40, (1996). 
8. V. Bharadwaj, D. Ghose, V. Mani and T.G. Robetazzi, Scheduling, Divisible Loads zn Parallel and Da&rib&cd 
Systems, IEEE Computer Society Press, Los Alamitos, CA, (1996). 
9. V. Bharadwaj, D. Ghose and V. Mani, Optimal sequencing and arrangement in distributed single-level 
networks with communication delays, IEEE tins. Parallel and Distributed Systems 5 (9), 968-976, (1994) 
10. H.J. Kim, G.-I. Jee and J.G. Lee, Optimal load distribution for tree network processors, IEEE Truns , 
Aerospace and Electronics Systems 32 (Z), 607-612, (1996). 
11. V. Mani and D. Ghose, Distributed computation in linear networks: Closed-form solutions, IEEE Trans 
Aerospace and Electronics Systems 30 (2), 471-483, (1994). 
12. D. Ghose and V. Mani, Distributed computation with communication delays: Asymptotic performance anal- 
ysis, J. Parallel and Distributed Computing 23 (3), 293-305, (1994). 
13. D. Ghose and H.J. Kim, Load partitioning and trade-off study for large matrix vector computatmns in 
multicast bus networks with communication delays, J. Parallel and Distributed Computation 55 (I). :$2-XI. 
(1998). 
14. D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation: Numerzcal Methods, Prentice-Hali, 
Englewood Cliffs, NJ, (1989). 
15. G.D. Barlaa, Collection-aware optimum sequencing of operations and closed-form solutions for the distribution 
of a divisible load on arbitrary processors trees, IEEE nuns. Parallel and Distributed Systems 9 (5). 429-441, 
(1998). 
16. J. Blazewicz and M. Drozdowski, Distributed processing of divisible loads with communication startup cost. 
Discrete Applied Math. 76 (l-3), (1997). 
17. M. Drozdowski, Selected Problems of Scheduling Tasks in Multiprocessor Computer Systems, No. 321 
Wydawnictwa Politechhniki Pozanskiej, Pozan, Poland, (1997). 
18. V. Bharadwaj, X. Li and C.C. Ko, On the influence of start-up costs in scheduling divisible loads on bus 
networks, IEEE mans. on Parallel and Distributed Systems 11 (12), 1288-1305, (2000). 
