Scheduling of Divisible Loads on Heterogeneous Distributed Systems by Abhay Ghatpande et al.
Selection of our books indexed in the Book Citation Index 
in Web of Science™ Core Collection (BKCI)
Interested in publishing with us? 
Contact book.department@intechopen.com
Numbers displayed above are based on latest data collected. 
For more information visit www.intechopen.com
Open access books available
Countries delivered to Contributors from top 500 universities
International  authors and editors
Our authors are among the
most cited scientists
Downloads
We are IntechOpen,
the world’s leading publisher of
Open Access books
Built by scientists, for scientists
12.2%
122,000 135M
TOP 1%154
4,800
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 179
Scheduling of Divisible Loads on Heterogeneous Distributed Systems
Abhay Ghatpande, Hidenori Nakazato and Olivier Beaumont
0
Scheduling of Divisible Loads on
Heterogeneous Distributed Systems
Abhay Ghatpande1*, Hidenori Nakazato1 and Olivier Beaumont2
1GITI, Waseda University, Tokyo 169-0051
2LaBRI, France 33405
1. Introduction
Divisible loads are a special class of applications that have regular linear structure, and which
if given a large enough volume, can be partitioned into independently- and identically-
processable load fractions (parts). Examples of applications that satisfy this divisibility prop-
erty include image processing and rendering, signal processing, computation of Hough trans-
forms, tree and database search, Monte Carlo simulations, computational fluid dynamics, and
matrix computations.
The partitioning of a divisible load, the allocation (mapping) of the parts to appropriate pro-
cessors for execution, and the sequencing (ordering) the transfer of the parts to and from the
processors, is together known as Divisible Load Scheduling (DLS). Divisible Load Theory (DLT)
is the framework that studies the optimization of DLS (Bharadwaj et al., 1996). Beaumont,
Casanova, Legrand, Robert & Yang (2005) recently published a review of the work done to
date in DLT. An exhaustive listing of papers regarding DLT and DLS is available on (Rober-
tazzi, 2008).
1.1 Shortcomings of Traditional DLT
The basic principle of DLT to determine an optimal schedule for a master-slave system is the
AFS (All slaves Finish Simultaneously) policy (Barlas, 1998). The AFS policy implies that
after the nodes finish computing their individual load fractions, no results are returned to
the source. This is an unrealistic assumption for many applications, as the result collection
phase can contribute significantly to the total execution time. This is a serious shortcoming of
traditional DLT. Along with the AFS policy, the presence of idle time in the optimal schedule
has been overlooked in DLT work on result collection and heterogeneity. It is a very important
issue because it may sometimes be possible to improve a schedule by inserting idle time.
A few papers that have dealt with DLS on heterogeneous systems to date (Beaumont, Mar-
chal, Rehn & Robert, 2005; Beaumont et al., 2006; Beaumont, Marchal & Robert, 2005; Bharad-
waj et al., 1996; Comino & Narasimhan, 2002; Rosenberg, 2001) proved that the sequence of
allocation of data to the processors is important in heterogeneous networks. Without consid-
ering result collection, they proved that for optimum performance, (a) when processors have
equal computation capacity, the optimal schedule results when the fractions are allocated in
the order of decreasing communication link capacity, and (b) when communication capacity
*Corresponding author: abhay.ghatpande@ieee.org
1
www.intechopen.com
Parallel and Distributed Computing180
is equal, the data should be allocated in the order of decreasing computation capacity. As far
as can be judged, no paper has given a satisfactory solution to the scheduling problem where
both the network bandwidth and computation capacities of the slaves are different, and the
result transfer to the master is explicitly considered.
Cheng & Robertazzi (1990) and Bharadwaj et al. (1996, Chap. 3) addressed the issue of result
collection with a simplistic constant result collection time, which is possible only for a limited
number of applications on homogeneous networks. All other papers that have addressed
result collection to date, advocated FIFO (First In, First Out) and LIFO (Last In, First Out) type
of schedules. In FIFO, results are collected in the same order as that of load allocation, while
in LIFO, the order of result collection is reversed. Barlas (1998) addressed the result collection
phase for single-level and arbitrary tree networks, but the optimal sequences derived were
essentially LIFO or FIFO. Rosenberg (2001) too proposed the LIFO and FIFO sequences for
result collection. He concluded through simulations that FIFO is better when the network
is homogeneous with a large number of processors, while LIFO is advantageous when the
network is heterogeneous with a small number of processors.
For the first time, it was shown in (Beaumont, Marchal & Robert, 2005) that the LIFO and
FIFO orderings are not always optimal for a given set of processors. In (Beaumont, Marchal,
Rehn & Robert, 2005; Beaumont et al., 2006), it was proved that all processors from a given
set of processors may not be used in the optimal solution. For the unidirectional single-port
communication model (see Section 2), (Beaumont, Marchal, Rehn & Robert, 2005; Beaumont
et al., 2006; Beaumont, Marchal & Robert, 2005) proved several interesting features in optimal
schedules.
1.2 Chapter Organisation
Section 2 explains the choices made to represent the communication and computation speeds,
the model used for size of result data, the assumptions and reasons regarding continuous
delivery of data, the unidirectional one-port communication model, and the decision to use
linear models of computation and communication time. Sections 2.3 and 3 provide a detailed
derivation of the DLSRCHETS problem definition. After first laying the theoretical basis, the
DLSRCHETS problem is defined in terms of a linear program. Section 4 lays the foundation of
the two-slave system that forms the basis for the SPORT algorithm. Section 5 introduces the
SPORT algorithm as a solution to the DLSRCHETS problem. Given a set of processors sorted
in the order of decreasing communication speed, the complexity of SPORT is O(m). Section 6
summarizes the chapter and ideas for future work.
2. The System Model
The execution of a divisible job on each slave comprises of three distinct phases in the fol-
lowing order — the allocation phase, where data is sent to the slave from the master, the
computation phase, where the data is processed, and the result collection phase, where the
slave sends the result data back to the source. The computation phase begins only after the
entire load fraction allocated to that slave is received from the source. Similarly, the result
collection phase begins only after the entire load fraction has been processed, and is ready
for transmission back to the master. This is known as the non-preemptive, atomic, or block based
model, and each phase forms a block on the time line as shown in Fig. 1.
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 181
Time
p1
p2
p3
T
p4
Allocation Computation Collection
Fig. 1. A general schedule for DLSRCHETS. Processors can do only one thing at a time —
either compute or communicate. There are three phases for each processor — allocation, com-
putation, and result collection, in that order. However, phases of different processors may be
interleaved. Each phase is atomic, i.e., continues to its end without interruption. Communi-
cation phases (either allocation or collection) cannot overlap as shown by the dashed lines.
Computation phases are independent of each other.
2.1 Communication and Computation Model
The non-preemptive communication and computation phases necessitate that the slaves are
continuously and exclusively available during the course of execution of the divisible load.
The master and slaves can do only any one thing at a time — either communicate or com-
pute (the no-overlap model), and if communicating, then either send data or receive data (the
unidirectional one-port model).
A heterogeneous master-slave (sometimes called as star or single-level tree) systemH = (P ,L)
is as shown in Fig. 2, where P = {p0, . . . , pm} is the set of m + 1 processors, and L =
{l1, . . . , lm} is the set of m network links that connect the master scheduler (source) p0 at the
center of the star (root of the tree), to the slave processors p1, . . . , pm at the points of the star
(leaves of the tree). E = {E1, . . . , Em} is the set of unit computation times of the slave proces-
sors, and C = {C1, . . . ,Cm} is the set of unit communication times of the network links, i.e.,
pk takes Ek time units to process a unit load transmitted to it from p0 in Ck time units over the
link lk. It follows that Ek,Ck > 0, k ∈ {1, . . . ,m}. The values in E and C are assumed to be
deterministic and available at the master.
The master holds a divisible load (job) J that is to be distributed and processed on H. Based
on the unit communication and computation time values of the slaves, the master p0 splits J
into parts (fractions) α1, . . . , αm and sends them to the respective slave processors p1, . . . , pm
for computation. Each such set of m fractions is known as a load distribution α = {α1, . . . , αm}.
The source does not retain any part of the load for computation. Since the job J is assumed
to be arbitrarily divisible, αk ∈ R
+
0 , αk ≥ 0, k ∈ {1, . . . ,m}. The unit communication and
computation times are conditional upon the job J under consideration. So ideally, the values
should be indexed as CJk and E
J
k , to indicate that the values are valid only for the job J . This
index is omitted as the context is clear to be the job J .
www.intechopen.com
Parallel and Distributed Computing182
p0
E0
p1
E1
l1
C1
p2
E2
l2
C2
pk
Ek
lk
Ck
pm
Em
lm
Cm
Fig. 2. The heterogeneous master-slave systemH. The processors have different computation
speeds and network bandwidths.
2.2 Result Data Model
For the divisible loads under consideration, the computation phase usually involves simple
linear transformations on the input data, and the volume of returned results can be considered
to be proportional to the amount of load received in the allocation phase. If the allocated load
fraction is αk, then the returned result is equal to δαk, 0 ≤ δ ≤ 1. The constant δ is application
specific, and is the same for all processors for a particular load J . This is the accepted model
for returned results in literature to date (Adler et al., 2003; Barlas, 1998; Beaumont, Marchal,
Rehn & Robert, 2005; Beaumont et al., 2006; Beaumont, Marchal & Robert, 2005; Bharadwaj
et al., 1996; Comino & Narasimhan, 2002; Rosenberg, 2001; Yu & Robertazzi, 2003).
2.3 Communication and Computation Time
The time taken for communication and computation is assumed to be a linearly increasing
function of the size of load fraction. For a load fraction αk, αkCk is the transmission time from
p0 to pk, αkEk is the time it takes pk to perform the requisite processing on αk, and δαkCk is the
time it takes pk to finally transmit the results back to p0. Though a linear model is considered
for computation and communication times for the sake of simplicity, all results can be easily
extended to other models.
In the DLSRCHETS problem, the master has to partition the load J into fractions α1, . . . , αm,
and manage the allocation of these fractions to, and collection of the results from the proces-
sors p1, . . . , pm in the minimum possible time. Let T = {1, . . . ,m} be the set of tasks corre-
sponding to the m fractions that are allocated to, and R = {1, . . . ,m} be the set of results that
are collected from the processors p1, . . . , pm respectively.
Though the load fractions (tasks) can be processed independently of each other on the respec-
tive processors, the single-port communication model implicitly induces a precedence order on
the distribution of the tasks and collection of the results. Let ≺a and ≺c be total orders on the
sets T and R respectively, such that ≺a represents the sequence (order) in which processors
are allocated tasks, and≺c is the sequence in which results are collected from the processors at
the master. Then, i ≺a j implies that task i precedes task j (or equivalently task j succeeds task i)
in the allocation sequence≺a, and i ≺c j signifies that result i precedes result j in the collection
sequence ≺c. If {k ∈ T : i ≺a k ≺a j} = ∅, then task i is the immediate predecessor of task j in
≺a, and is denoted as i a j. Similarly, if {k ∈ R : i ≺c k ≺c j} = ∅, then result j is the immedi-
ate successor of result i in≺c, and is denoted as i c j. Define Bi≺a := {j ∈ T : j ≺a i} ∪ {i} and
Fi≺a := {j ∈ T : i ≺a j} ∪ {i}, i.e., B
i
≺a is the set of task i and the tasks before i (predecessors of i)
in ≺a, while Fi≺a is the set of task i and the followers (successors) of task i in ≺a. B
i
≺c and F
i
≺c are
defined accordingly for≺c. The minimal element of≺a is defined as≺+a := ∃! i ∈ T : B
i
≺a = {i}
and the maximal element of ≺a is defined as, ≺−a := ∃! i ∈ T : F
i
≺a = {i}, i.e., ≺
+
a and ≺
−
a are
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 183
Time
p1
α1C1
α1E1
δα1C1
T
p2
α2C2
α2E2
δα2C2
p3
α3C3
α3E3
δα3C3
Allocation Computation Collection
Fig. 3. A possible schedule with m = 3. The three phases of each processor are atomic and
satisfy the constraints (1) to (9).
the first and last tasks allocated in ≺a. ≺+c and ≺
−
c are similarly defined as the first and last
results returned in ≺c.
For a given load J , the objective is to minimize the total processing time T, which is defined
as the time taken from the point when the master first initiates the allocation of tasks, to the
point when the master completes reception of all the results. The schedule S of DLSRCHETS
for a given load distribution α, is a pair (t, r), where, t : T → R+0 is the task allocation start
time function, and r : R → R+0 is the result collection start time function. In a feasible schedule,
the start times in t and r must satisfy the following constraints:
tj − ti ≥ αiCi ∀ i ∈ {1, . . . ,m}, i a j (1)
ti ≥ ∑
j∈Bi≺a \{i}
αjCj ∀ i ∈ {1, . . . ,m} (2)
rj − ri ≥ δαiCi ∀ i ∈ {1, . . . ,m}, i c j (3)
T − ri ≥ ∑
j∈Fi≺c
δαjCj ∀ i ∈ {1, . . . ,m} (4)
ri − ti ≥ αiCi + αiEi ∀ i ∈ {1, . . . ,m} (5)
ti = rj ∀ i, j ∈ {1, . . . ,m} (6)
rj − ti ≥ αiCi ∀ j ∈ {1, . . . ,m}, ∀ ti < rj (7)
ti − rj ≥ δαjCj ∀ i ∈ {1, . . . ,m}, ∀ rj < ti (8)
ti, rj ≥ 0 ∀ i, j ∈ {1, . . . ,m} (9)
The precedence constraints of ≺a are enforced by (1) and (2), while inequalities (3) and (4)
impose the precedence constraints of ≺c and define the processing time T. The fact that the
result collection cannot begin before the execution of the entire load fraction is complete is
shown by (5). Constraints (6), (7), and (8) impose the single-port model so that no allocation
and collection phase can overlap. The non-negativity of the start times is ensured by (9).
Figure 3 shows the timing diagram for a feasible schedule with m = 3. The time spent in
communication with the master p0 is shown above the horizontal axes, and time spent in
computation by the individual processors below the horizontal axes. Since p0 does not retain
any part of the load for itself, there is no p0 axis.
www.intechopen.com
Parallel and Distributed Computing184
Time
pi
δαiCi
pj
αjCj
Allocation Collection
ri
tj
Fig. 4. Interleaved result collection. There exists at least one pair of ri and tj that immediately
follow each other.
Condition 1 (Allocation Precedence Condition). The master should first allocate the entire
load to the processors before receiving any results from the processors.
Lemma 1 (Allocation Precedence Lemma). There exists an optimal schedule for DLSRCHETS that
satisfies the allocation precedence condition. (There may exist other optimal schedules that do not satisfy
the allocation precedence condition.)
Proof. Consider a feasible schedule with processing time T, that satisfies (1) to (9) for a load
distribution α, and an arbitrary order of allocation and collection ≺a and ≺c, such that some
results are collected before the load is completely allocated first.
Then, there exists at least one pair (i, j) with i ≺a j, such that the result collection starting at ri
is followed by a task allocation at tj, without any other intermediate communication phase as
shown in Fig. 4.
Suppose that all load fractions in α, and all other start times in t and r are maintained the
same, and only the order of collection of result i and allocation of task j is exchanged, such
that the new allocation start time of task j is t′j = ri, and the new collection start time of result
i is r′i = ri + αjCj.
Since the above exchange does not alter the order of allocation of different tasks, the prece-
dence constraints of ≺a defined by (1) and (2) still hold. Similarly, the precedence constraints
of ≺c, imposed by (3) and (4) also hold after the exchange. The constraints (6), (7), and (8) are
valid after the exchange because the single-port model is not violated by the exchange.
Only the conditions expressed by (5) require verification. Before the exchange, the conditions
ri − ti ≥ αiCi + αiEi and rj − tj ≥ αjCj + αjEj are satisfied. After the exchange, the con-
straints (5) are still valid because r′i − ti = ri + αjCj − ti > ri − ti, and rj − t
′
j = rj − ri > rj − tj.
From the above observations, it is clear that after the reordering, all conditions for feasibility
are still satisfied. Moreover, the orders ≺a and ≺c are unchanged, and no additional process-
ing time is required for the reordering.
If a similar reordering is carried out for all such pairs (i, j), then the allocation precedence
condition is satisfied with no addition in total processing time T.
Now if there is an optimal schedule for DLSRCHETS that does not satisfy the allocation prece-
dence condition, then a reordering can be performed as mentioned above so that the schedule
satisfies the allocation precedence condition without an increase in the total processing time.
That is, there always exists an optimal schedule that satisfies the allocation precedence condi-
tion, and only such schedules need be considered in the search for the optimal schedule.
Two other basic lemma are stated before the DLSRCHETS problem is defined.
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 185
Lemma 2. There exists an optimal schedule for DLSRCHETS that has no idle time between any two
consecutive allocation phases and any two consecutive result collection phases. (There may exist other
optimal schedules that do not satisfy this condition.)
Proof. Assume that a feasible schedule that obeys (1) to (9), and in addition also satisfies the
allocation precedence condition, has idle time between the consecutive communication phases
(see Fig. 3). Let the processing time be T, the load distribution be α, and (≺a,≺c) be the orders
of allocation and collection.
According to the assumptions in the system model, all processors are available continuously
and exclusively during the entire execution process, and the master can only communicate
with one processor at a time. For any i a j, when processor pi completes the reception of
its allocated task at time ti + αiCi, processor pj is already available and can start receiving
data immediately at tj = ti + αiCi. Because the schedule satisfies the allocation precedence
condition, load is first distributed to all the processors sequentially before result collection
begins. Thus the start time of each task i ∈ T can be brought forward so that ti = t≺+a +
∑j∈Bi≺a \{i}
αjCj, and the inequalities (1) and (2) are reduced to equalities without exceeding T.
Following a similar logic to the one above, the result collection of each result i ∈ R can be
delayed to the extent necessary to make the result collection start time ri = T −∑j∈Fi≺c
δαjCj,
with inequalities (3) and (4) reduced to equalities and no extra time added to T.
Since any feasible schedule can be reordered in this manner to eliminate the idle time between
communication phases, it follows that an optimal schedule to DLSRCHETS also has no idle
time between any two consecutive allocation and result collection phases.
Lemma 3. There exists an optimal schedule for DLSRCHETS that has no idle time between the allo-
cation and computation phases of each processor. (There may exist other optimal schedules that do not
satisfy this condition.)
Proof. Following an argument similar to the one used in Lemma 2, since all processors are
always available, they can begin computing immediately upon receiving their load fractions
in the allocation phase without affecting the schedule.
Any processor pi begins computing its allocated task at time t≺+a + ∑j∈Bi≺a
αjCj without cross-
ing the time interval T. Since any feasible schedule can be reordered in this manner, an optimal
schedule to DLSRCHETS too has no idle time between the allocation and computation phases
of each processor.
Theorem 1 (Feasible Schedule Theorem). There exists an optimal schedule for DLSRCHETS that
satisfies Lemmas 1 to 3.
Proof. If there exists an optimal schedule that does not satisfy any or all of the Lemmas 1 to 3,
it can always be reordered as explained in the respective proofs to satisfy the same.
From Theorem 1, it follows that only those schedules that satisfy Lemmas 1 to 3 need be
considered in the search for the optimal solution to DLSRCHETS. A possible timing diagram
for such a schedule is shown in Fig. 5.
From the preceding discussion, it can be concluded that the start times t and r in the optimal
schedule for DLSRCHETS can be determined from the sequences ≺a and ≺c, and the load
distribution α that minimize the processing time T. Hence instead of finding t and r as in tra-
ditional scheduling practice, the DLSRCHETS problem is formulated as a linear programming
www.intechopen.com
Parallel and Distributed Computing186
Time
p1
α1C1
α1E1
δα1C1
T
p2
α2C2
α2E2
δα2C2
p3
α3C3
α3E3
δα3C3
x1
x2
x3
y
Fig. 5. A schedule for m = 3 that satisfies the Feasible Schedule Theorem. Result collection
begins only after the entire load is distributed. Each allocation and result collection phase
follows its predecessor without delay. The computation phase of each processor follows its
allocation phase without delay. Idle time may be present in each processor between the end
of its computation phase and the start of the result collection phase.
problem, to find ≺a, ≺c, and α that minimize T. Once the optimal values of these variables
are known, it is straightforward to find the optimal schedule.
The constraints (1) to (9) and the allocation precedence condition are combined into a unified
form, and for each processor pi, constraints on T are written in terms of B
i
≺a and F
i
≺c . The
DLSRCHETS problem is defined in terms of a linear program as follows.
Definition 1 (Divisible Load Scheduling with Result Collection on HETerogeneous Systems).
Given a heterogeneous network H = (P ,L), a divisible load J , unit communication
and computation times C, E , find the sequence pair (≺∗a ,≺
∗
c ), and load distribution α
∗ =
{α∗1 , . . . , α
∗
m} that
Minimize T
Subject To:
∑
j∈Bk≺a
αjCj + αkEk + ∑
j∈Fk≺c
δαjCj ≤ T k = 1, . . . ,m (10)
m
∑
j=1
αjCj +
m
∑
j=1
δαjCj ≤ T (11)
m
∑
j=1
αj = J (12)
T ≥ 0, αk ≥ 0 k = 1, . . . ,m (13)
In the above formulation, for a sequence pair (≺a,≺c), and a load distribution α, the LHS
(Left Hand Side) of constraint (10) indicates the total time spent in transmission of tasks to
all the processors that must receive load before the processor pi can begin processing its al-
located task, the computation time on the processor pi itself, and the time for transmission
back to the master of results of processor pi, and all its subsequent result transfers. For the
no-overlap model to be satisfied, the processing time T should be greater than or equal to
this time for all the m processors. The single-port communication model is enforced by (11)
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 187
since its LHS represents the lower bound on the time for distribution and collection under this
model. The fact that the entire load is distributed amongst the processors is imposed by (12).
This is the normalization equation. The non-negativity of the decision variables is ensured by
constraint (13).
3. Analysis of Optimal Solution
Processors that are allocated load are called participating processors or participants.
Theorem 2 (Idle Time Theorem). There exists an optimal solution to the DLSRCHETS problem,
in which irrespective of whether load is allocated to all available processors, at the most one of the
participating processors has idle time, and the idle time exists only when the result collection begins
immediately after the completion of load distribution.
Proof. For a pair (≺a,≺c), the DLSRCHETS problem defined by (10) to (13) always has a
feasible solution. This is because, for any load distribution α that satisfies (12), T can be made
arbitrarily large to satisfy the inequalities (10) and (11). It implies that the polyhedron formed
by the constraints of the DLSRCHETS problem, P := {x ∈ Rm+1 : Ax ≤ b, x ≥ 0} = ∅.
According to the theory of linear programming, the optimal solution to DLSRCHETS is
obtained at some vertex of this polyhedron (Dantzig, 1963; Vanderbei, 2001). As the DL-
SRCHETS problem has m + 1 decision variables and 2m + 3 constraints, in a non-degenerate
optimal solution, at the optimal vertex, m+ 1 constraints out of these must be tight, i.e., satis-
fied with equality. In a degenerate optimal solution, more than m+ 1 constraints are tight.
It is clear that in an optimal solution, the normalization constraint (12) will always be tight,
and T will always be greater than zero. This means that m constraints out of the remaining
2m+ 1 constraints will be tight in a non-degenerate optimal solution. There are two possible
ways to proceed with the analysis at this point depending on the allocated load fractions in
the optimal solution.
1. ∀ k ∈ {1, . . . , m} : αk > 0.
In this case, all the load fractions are assumed to be always greater than zero, i.e. num-
ber of participants is m. Since all decision variables are positive, there can be no degen-
eracy (Vanderbei, 2001, Chapter 3).
It leaves only m+ 1 constraints (10) and (11), out of which mwill be tight in the optimal
solution. Hence, in the optimal solution, either,
(a) the m constraints (10) are tight, and the (11) constraint is not, or
(b) the (11) constraint is tight and one of the (10) constraints is not.
If any constraint from (10) and (11) is not tight in the optimal solution, it implies a
shortfall in the LHS as compared to the optimal processing time. In constraints (10) this
shortfall represents idle time in a processor, while in (11) it represents the intervening
time interval between completion of load distribution from the master and the start of
result transfer to the master.
Thus, if the option (a) above is true, then none of the processors have any idle time
in the optimal solution. If the option (b) is true, then one of the processors has idle
time, and since this happens only when constraint (11) is tight, it means that idle time
in a processor exists only when result transfer to the master begins immediately after
completion of load allocation is completed. This is similar to the analysis in Beaumont,
Marchal, Rehn & Robert (2005); Beaumont et al. (2006).
www.intechopen.com
Parallel and Distributed Computing188
2. ∃ k ∈ {1, . . . , m} : αk = 0.
In this case, some of the processors can be allocated zero load in the optimal solution.
The analysis has two parts — one for non-degenerate and the other for degenerate op-
timal solutions.
Non-degenerate Optimal Solution
If there are p (p ≤ m) participants in the optimal solution,then m− p constraints of (13)
are necessarily tight. This means that out of the m + 1 constraints (10) and (11), only p
constraints will be tight in the optimal solution. Hence, in an optimal solution, either,
(a) p of the (10) constraints are tight, m − p of the (10) constraints are not tight, and
the (11) constraint is not tight, or
(b) the (11) constraint is tight, p− 1 of the (10) constraints are tight, and m− p + 1 of
the (10) constraints are not tight.
In the optimal solution, if the option (a) is true, then m − p processors have idle time,
while if the option (b) is true, then m− p + 1 processors have idle time.
Since m− p processors are not allocated load, it is obvious that they are idle throughout
in either of the above two options. The additional processor with idle time if the op-
tion (b) is true has to be one of the participating processors. This means that idle time
in a participating processor exists only when the result collection begins immediately
upon completion of load allocation.
Degenerate Optimal Solution
Similar to the non-degenerate case, if there are p (p ≤ m) participants in the optimal
solution, then m− p constraints of (13) are necessarily tight. Since the optimal solution
is degenerate, more than p constraints out of the m + 1 constraints (10) and (11) will be
tight.
This means that in the optimal solution, irrespective of whether the (11) constraint is
tight, at least p of the (10) constraints are tight, and less than m− p of the (10) constraints
are not tight. Since m − p processors are necessarily idle, some of the (10) constraints
corresponding to the processors allocated zero load are tight in the degenerate solution.
Since ∀ k ∈ {1, . . . ,m}, Bk≺a , F
k
≺c ⊆ {1, . . . ,m}, it implies that,
∑
j∈Bk≺a
αjCj ≤
m
∑
j=1
αjCj k ∈ {1, . . . ,m}
and
∑
j∈Fk≺c
δαjCj ≤
m
∑
j=1
δαjCj k ∈ {1, . . . ,m}
It follows that,
∑
j∈Bk≺a
αjCj + ∑
j∈Fk≺c
δαjCj ≤
m
∑
j=1
αjCj +
m
∑
j=1
δαjCj k ∈ {1, . . . ,m} (14)
If (11) is not tight, then the RHS (Right Hand Side) of (14) is strictly less than T. That is,
∑
j∈Bk≺a
αjCj+ ∑
j∈Fk≺c
δαjCj < T k ∈ {1, . . . ,m} (15)
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 189
If ∃ k ∈ {1, . . . ,m} : αk = 0, then αkEk = 0, and from (15), it immediately follows that
the corresponding constraint from (10) can never be tight.
Thus, a constraint corresponding to a processor pk allocated zero load is tight in the
optimal solution only if
∑
j∈Bk≺a
αjCj + ∑
j∈Fk≺c
δαjCj − T = 0 (16)
or equivalently if (14) is satisfied with an equality, and the RHS of (14) is equal to T, i.e,
the (11) constraint is tight.
It is now clear that a degenerate optimal solution exists only when the (11) constraint is
tight, and the condition (16) is satisfied. To findwhen the condition is satisfied, consider
the case where for some pair (≺a,≺c), one or more of the processors allocated zero
load follow each other at the end of the allocation sequence and the start of the result
collection sequence in the optimal solution.
For example, if αi, αj, αk = 0, and one or more of the following occur (the list is not
exhaustive):
• ≺−a = i and ≺
+
c = i
• i a j, ≺−a = j and ≺
+
c = i
• i a j, ≺−a = j, ≺
+
c = k and k c i
Only if such tail-end zero-load processors exist, then (14) is satisfied with an equality.
Finally, if constraint (11) is tight in the optimal solution, then it follows that the con-
straints corresponding to these processors are tight.
The linear program obtained after eliminating the redundant constraints correspond-
ing to the tail-end zero-load processors has a non-degenerate optimal solution. This
is because, the feasible region defined by the constraints of the non-degenerate prob-
lem does not change after addition of the redundant constraints. Hence only a single
participant processor has idle time in the degenerate optimal solution.
From the preceding discussion on the optimal solution to the linear program for a pair (≺a
,≺c), it follows that in the optimal solution to the DLSRCHETS problem, (≺∗a ,≺
∗
c , α
∗), at
the most one participating processor can have idle time. The idle time occurs only when the
result collection from processor ≺+c starts immediately after completion of load allocation to
processor ≺−a .
There are m! possible permutations each of ≺a and ≺c, and the linear program has to be eval-
uated (m!)2 times to determine the globally optimum solution (≺∗a ,≺
∗
c , α
∗) for DLSRCHETS.
Since the solution to the linear program is completely determined by the values of δ, C and E ,
along with the pair (≺a,≺c), it is not possible to predict which of the processors or howmany
processors will be allocated zero load.
4. Analysis of Two-Slave System
For a sequence pair (σa, σc) and load distribution α = {α1, . . . , αm}, a slave processor pi, may
have idle time xi because it may have to wait for another processor to release the commu-
nication medium for result transfer (ref. Fig. 5). In the optimal solution to DLSRCHETS,
∀i ∈ {1 . . . m}, xi = 0, if and only if y > 0, and that there exists a unique xi > 0 if and only
if y = 0, where y is the intervening time interval between the end of allocation phase of pro-
cessor σa[m] and the start of result collection from processor σc[1]. For the FIFO schedule in
www.intechopen.com
Parallel and Distributed Computing190
p0
E0
p1
E1
l1
C1
p2
E2
l2
C2
Eqv.
p0
E0
p1:2
E1:2
l1:2
C1:2
Fig. 6. The heterogeneous two-slave system. The two processors p1 and p2 are replaced by an
equivalent virtual processor p1:2 on the right. The two network links l1 and l2 are replaced by
an equivalent virtual link l1:2. As far as the master p0 is concerned, there is no difference in
the time it takes for the equivalent processor to execute a task.
particular, processor σa[m] can always be selected to have idle time when y = 0, i.e., in the
FIFO schedule, xσa [m] > 0 if and only if y = 0. In the LIFO schedule, since y > 0 always,
no processor has idle time, i.e., ∀i ∈ {1 . . .m}, xi = 0 always (Beaumont, Marchal, Rehn &
Robert, 2005; Beaumont et al., 2006; Beaumont, Marchal & Robert, 2005).
Let the allocation sequence be represented by σa, and the collection sequence by σc, both of
which are permutations of the index set K = {1, . . . ,m} of slave processors in the heteroge-
neous system H. For a pair (σa, σc), the solution to the linear program defined by (10) to (13)
is completely determined by the values of δ, E , C, and it is not possible to predict which pro-
cessor is the one that has idle time in the optimal solution. In fact, it is possible that not all
processors are allocated load in the optimal solution, in which case some processors are idle
throughout.
The heterogeneous system H = (P ,L) with m = 2 is shown in Fig. 6. It is defined by P =
{p0, p1, p2} and L = {l1, l2}. The unit computation and communication times are defined by
the sets E = {E1, E2}, and C = {C1,C2}. Without loss of generality, it is assumed that the total
load to be processed available at the master is J = 1. Also it is assumed that C1 ≤ C2. No
assumptions are possible regarding the relationship between E1 and E2, or C1 + E1 + δC1 and
C2 + E2 + δC2.
An important parameter, ρk, known as the network parameter is introduced, which indicates for
a slave pk, how fast (or slow) its computation parameter Ek is with respect to the communica-
tion parameter Ck of its network link:
ρk =
Ek
Ck
k = 1, . . . ,m (17)
The master p0 distributes the load J between the two slave processors p1 and p2 so as to
minimize the processing time T. Depending on the values of δ, E and C, there are three possi-
bilities:
1. Entire load is distributed to p1 only.
The total processing time is given by
T1 = C1 + E1 + δC1 = C1(1+ δ + ρ1) (18)
2. Entire load is distributed to p2 only.
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 191
The total processing time in this case is
T2 = C2 + E2 + δC2 = C2(1+ δ + ρ2) (19)
3. Load is distributed to both p1 and p2.
It can be proved that as long as C1 ≤ C2, only the schedules in Figs. 7, 8, and 9 can
be optimal for a two-slave system. These schedules are the FIFO schedule, the LIFO
schedule, and the FIFO schedule with idle time in p2.
These schedules are referred to as Schedule f , Schedule l, and Schedule g respectively.
Superscripts f , l, and g are used to distinguish the three schedules. The equations for
load fractions, processing times, and the conditions for optimality of Schedules f , l,
and g are not derived on account of space constraints. The interested reader is directed
to (Ghatpande, Nakazato, Beaumont & Watanabe, 2008) for details.
4.1 Optimal Schedule in Two-Slave System
A few lemmas and theorems to determine the optimal schedule for a two-slave system are
now stated without proof. Please refer to Ghatpande, Nakazato, Beaumont &Watanabe (2008)
for the proofs.
Lemma 4. It is always advantageous to distribute the load to both the processors, rather than execute
it on the individual processors (for the system model under consideration).
Lemma 5 (Idle Indicator Lemma). ρ1ρ2 ≤ δ is a necessary and sufficient condition to indicate the
presence of idle time in the FIFO schedule (i.e. Schedule g).
The simplicity of the condition to detect the presence of idle time in the FIFO schedule is both
pleasing and surprising, and has been derived for the first time ever. Further confirmation of
this condition is obtained in Sect. 4.2.
Theorem 3 (Optimal Schedule Theorem). The optimal schedule for a two-slave system can be found
as follows:
1. If δC2 > C1(1+ δ + ρ1), then Schedule l is optimal.
2. Else If δC2 ≤ C1(1+ δ + ρ1), ρ1ρ2 ≤ δ and C2 ≤ C1
(
1+
(1+ρ1)ρ2
δ(1+δ+ρ2)
)
, then Schedule g is
optimal.
3. Else if δC2 ≤ C1(1+ δ + ρ1), ρ1ρ2 ≤ δ and C2 > C1
(
1+
(1+ρ1)ρ2
δ(1+δ+ρ2)
)
, then Schedule l is
optimal.
4. Else If δC2 ≤ C1(1+ δ + ρ1), ρ1ρ2 > δ, and T
f ≤ C1C2
(C2−C1)
, then Schedule f is optimal.
5. Else if δC2 ≤ C1(1+ δ + ρ1), ρ1ρ2 > δ, and T
f
>
C1C2
(C2−C1)
, then Schedule l is optimal.
The optimal solution to DLSRCHETS, (σ∗a , σ
∗
c , α
∗), for a system with two slave processors is a
function of the system parameters and the application under consideration, because of which,
no particular sequence of allocation and collection can be defined a priori as the optimal se-
quence. The optimal solution can only be determined once all the parameters become known.
www.intechopen.com
Parallel and Distributed Computing192
Time
p1
α1C1
α1E1
δα1C1
p2
α2C2
α2E2
δα2C2
T f
Original Schedule f
p1:2
α1:2C1:2
α1:2E1:2
δα1:2C1:2
T
f
1:2
Equivalent Schedule f
Fig. 7. Equivalent processor in Schedule f . The total communication time remains the same as
the original two processors. The equivalent computation time is equal to the interval between
the end of allocation to p2 and the start of result collection from p1.
4.2 The Concept of Equivalent Processor
To extend the above result to the general case with m slave processors, the concept of an
equivalent processor is introduced. Consider the system in Fig. 6. The processors p1 and p2 are
replaced by a single equivalent processor p1:2 with computation parameter E1:2, connected to
the root by an equivalent link l1:2 with communication parameter C1:2. The resulting system
is called the equivalent system and the resulting schedule is known as the equivalent schedule.
The values of the parameters for the three equivalent schedules are defined below.
If the initial load distribution is α = {α1, α2}, and the processing time is T, then the equivalent
system satisfies the following properties:
• The load processed by p1:2 is α1:2 = α1 + α2 = 1.
• The processing time is unchanged and equal to T.
• The time spent in load distribution and result collection is unchanged, i.e., for all three
schedules,
– α1:2C1:2 = α1C1 + α2C2, and
– δα1:2C1:2 = δα1C1 + δα2C2.
• The time spent in load computation is equal to the intervening time interval between
the end of allocation phase and the start of result collection phase, i.e.,
– For Schedule f , α1:2E
f
1:2 = α1E1 − α2C2 = α2E2 − δα1C1.
– For Schedule l, α1:2E
l
1:2 = α2E2 = α1E1 − α2C2 − δα2C2.
– For Schedule g, α1:2E
g
1:2 = 0.
4.3 The Equivalent Processor Theorem
This leads to the following theorem: (refer to (Ghatpande, Nakazato, Beaumont & Watanabe,
2008) for proof.)
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 193
Time
p1
α1C1
α1E1
δα1C1
Tl
p2
α2C2
α1:2E1:2
δα2C2
Original Schedule l
p1:2
α1:2C1:2
α1:2E1:2
δα1:2C1:2
Tl1:2
Equivalent Schedule l
Fig. 8. Equivalent processor in Schedule l. The total communication time remains the same
as the original two processors. The equivalent computation time is equal to the computation
time of p2.
Theorem 4 (Equivalent Processor Theorem). In a heterogeneous system H with m = 2, the two
slave processors p1 and p2 can be replaced without affecting the processing time T, by a single (virtual)
equivalent processor p1:2 with equivalent parameters C1:2 and E1:2, such that C1 ≤ C1:2 ≤ C2 and
E1:2 ≤ E1, E2.
The equivalent processor enables replacement of two processors by a single processor with
communication parameter with a value that lies between the values of communication pa-
rameters of the original two links. Because of this property, if the processors are arranged so
that C1 ≤ C2 ≤ . . . ≤ Cm, and two processors are combined at a time sequentially starting
from the fastest two, then the resultant equivalent processor does not disturb the order of the
sequence.
The equivalent processor for Schedule f provides additional confirmation of the condition
for the presence of idle time in a FIFO schedule. It is known that idle time can exist in a
FIFO schedule only when the intervening time interval y = 0. According to the definition of
equivalent processor, this interval corresponds to the equivalent computation capacity E
f
1:2.
This value becomes zero only when ρ1ρ2 − δ = 0. Thus, if ρ1ρ2 < δ, then idle time must exist
in the FIFO schedule.
5. The SPORT Algorithm
Algorithm 1 (SPORT).
1: arrange p1, . . . , pm such that C1 ≤ C2 ≤ . . . ≤ Cm
2: σa ← 1, σc ← 1, α1 ← 1
3: for k := 2 to m do
4: C1←C1:k−1, E1←E1:k−1, C2←Ck, E2←Ek
www.intechopen.com
Parallel and Distributed Computing194
Time
p1
α1C1
α1E1
δα1C1
p2
α2C2
α2E2
δα2C2
Tg
Original Schedule g
p1:2
α1:2C1:2 δα1:2C1:2
T
g
1:2
Equivalent Schedule g
x2
Fig. 9. Equivalent processor in Schedule g. The total communication time remains the same
as the original two processors. The equivalent computation time is equal to zero as the result
collection begins immediately after the allocation phase ends.
5: if δC2 > C1(1+ δ + ρ1) then
6: /* Tl < T f , Tg, use Schedule l */
7: call schedule_lifo
8: else
9: /* Need to check other conditions */
10: if ρ1ρ2 ≤ δ then
11: /* Possibility of idle time */
12: if C2 ≤ C1
(
1+
(1+ ρ1)ρ2
δ(1+ δ + ρ2)
)
then
13: /* Tg < Tl , use Schedule g */
14: call schedule_idle
15: break for
16: else
17: /* Tl < Tg, use Schedule l */
18: call schedule_lifo
19: end if
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 195
20: else
21: /* No idle time present */
22: if T f ≤
C1C2
C2 − C1
then
23: /* T f < Tl , use Schedule f */
24: call schedule_fifo
25: else
26: /* Tl < T f , use Schedule l */
27: call schedule_lifo
28: end if
29: end if
30: end if
31: end for
32: n ← numberOfProcessorsUsed
33: /* Update load fractions from stored values */
34: αk ←
{
αk · ∏
n
j=2 α1:j if k = 1
αk · ∏
n
j=k α1:j if k = 2, . . . , n
35: T ← C1:n + E1:n + δ C1:n
The procedures in the algorithm are given below:
procedure schedule_idle
1: α1:k−1 ←
C2
C1ρ1 + C2
2: αk ←
C1ρ1
C1ρ1 + C2
3: /* Update sequences for FIFO */
4: σa ← {σa, k}
5: σc ← {σc, k}
6: /* Compute equivalent processor parameters */
7: C1:k ←
C1C2(1 + ρ1)
C1ρ1 + C2
www.intechopen.com
Parallel and Distributed Computing196
8: E1:k ← 0
9: numberOfProcessorsUsed ← k
10: return
procedure schedule_lifo
1: rl1 ← ρ1
2: rl2 ← 1+ δ + ρ2
3: α1:k−1 ←
C2r
l
2
C1r
l
1 + C2r
l
2
4: αk ←
C1r
l
1
C1r
l
1 + C2r
l
2
5: /* Update sequences for LIFO */
6: σa ← {σa, k}
7: σc ← {k, σc}
8: /* Compute equivalent processor parameters */
9: C1:k ←
C1C2(r
l
1 + r
l
2)
C1r
l
1 + C2r
l
2
10: E1:k ←
C1C2ρ1ρ2
C1r
l
1 + C2r
l
2
11: numberOfProcessorsUsed ← k
12: return
procedure schedule_fifo
1: r
f
1 ← δ + ρ1
2: r
f
2 ← 1+ ρ2
3: α1:k−1 ←
C2r
f
2
C1r
f
1 + C2r
f
2
4: αk ←
C1r
f
1
C1r
f
1 + C2r
f
2
5: /* Update sequences for FIFO */
6: σa ← {σa, k}
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 197
p1:n
p1:n−1
p1:3
p1:2
p1 p2
p3
p4
pn−1
pn
Fig. 10. The building of SPORT solution. At each step only two processors are involved
(the state space remains constant). The optimal schedule for two processors can be easily
computed in constant time using simple if-then-else statements in Theorem 3.
7: σc ← {σc, k}
8: /* Compute equivalent processor parameters */
9: C1:k ←
C1C2(r
f
1 + r
f
2 )
C1r
f
1 + C2r
f
2
10: E1:k ←
C1C2(ρ1ρ2 − δ)
C1r
f
1 + C2r
f
2
11: numberOfProcessorsUsed ← k
12: return
5.1 Algorithm Explanation
At the start, the processors are arranged so that C1 ≤ C2 ≤ . . . ≤ Cm, and two processors
with the fastest communication links are selected. The optimal schedule and load distribution
for the two processors are found according to Theorem 3. If Schedule f or l is found optimal,
then the two processors are replaced by their equivalent processor. In either case, since C1 ≤
C1:2 ≤ C2, the ordering of the processors does not change. In the subsequent iteration, the
equivalent processor and the processor with the next fastest communication link are selected
and the steps are repeated until either all processors are used up, or Schedule g is found to be
optimal. If Schedule g is found to be optimal in any iteration, then the algorithm exits after
finding the load distribution for that iteration.
The computation of the allocation and collection sequences is straightforward. The allocation
sequence σa is maintained in the order of decreasing communication link bandwidth of the
processors. Irrespective of the schedule found optimal in iteration k, k is always appended to
σa. The collection sequence σc is constructed as follows:
• If Schedule f or g is found optimal in iteration k, k is appended to σc.
www.intechopen.com
Parallel and Distributed Computing198
α1:n
α1:n−1
α1:3
α1:2
α1 α2
α3
α4
αn−1
αn
*
*
*
*
*
Fig. 11. Calculating the load fractions in SPORT. α′1 is the initial value of α1. It is multiplied by
the product term in (20) to get the final value of α1 = α1:n · α1:n−1 · · · α1:2 · α
′
1. This is equivalent
to traversing the binary tree from the root to the leaf nodes and taking the product of all nodes
(values) encountered. This calculation can be implemented in O(m) time by starting with αm
and storing the intermediate values.
• If Schedule l is found optimal in iteration k, k is prepended to σc.
The calculation of load distribution to the processors occurs simultaneously with the search
for the optimal schedule. As shown in Fig. 11, the algorithm creates a one-sided binary tree of
load fractions. If the number of processors participating in the computation is n, 2 ≤ n ≤ m,
the root node of the binary tree is α1:n and the leaf nodes represent the final load fractions
allocated to the processors. The value of the root node need not be calculated as it is equal to
one. The individual load fractions, αk, are initially assigned value α
′
k (say), and then updated
at the end as:
αk =
{
α
′
k · ∏
n
j=2 α1:j if k = 1
α
′
k · ∏
n
j=k α1:j if k = 2, . . . , n
(20)
This is equivalent to traversing the binary tree from the root to each leaf node and taking the
product of the nodes encountered (see Fig. 11). This calculation can be easily implemented in
O(m) time by starting with the computation of αn, and storing the values of the product terms
(i.e. ∏ α1:j) for each processor and then using that value for the next processor.
Once the sequences (σa, σc) and load distribution α are found, calculating the processing time
is straightforward. The processing time is simply the sum of the (equivalent) parameters of
the equivalent processor p1:n, i.e., T = C1:n + E1:n + δ C1:n.
In SPORT, defining the allocation sequence by sorting the values of Ck requires O(m logm)
time, while finding the collection sequence and load distribution requires O(m) time in the
worst case. Thus, if sorted values of Ck are given, then the overall complexity of the algorithm
is polynomial in m and is equal to O(m).
5.2 Simulations and Analysis
The performance of SPORT was compared to four algorithms, viz. OPT, FIFOC, LIFOC, and
ITERLP. The globally optimal schedule OPT is obtained after evaluation of the linear pro-
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 199
Table 1. Minimum statistics for SPORT simulations. In sets 1 and 2, the minimum errors in
LIFOC are 2 orders of magnitude higher than SPORT, ITERLP, and FIFOC. In sets 3 and 4,
FIFOC error is 2 to 3 orders of magnitude higher than the other three algorithms.
Set m
δ = 0.2 δ = 0.5
SPORT ITERLP LIFOC FIFOC SPORT ITERLP LIFOC FIFOC
1
4 5.73e-03 4.32e-03 8.08e-01 5.76e-03 2.20e-02 1.06e-02 1.07e+00 2.21e-02
5 7.89e-04 6.90e-04 7.21e-01 7.89e-04 5.40e-03 4.21e-03 9.63e-01 5.30e-03
2
4 1.01e-02 5.78e-03 8.41e-01 1.01e-02 2.37e-02 1.43e-02 1.15e+00 2.40e-02
5 3.34e-03 2.10e-03 7.93e-01 3.34e-03 1.06e-02 8.92e-03 1.10e+00 1.07e-02
3
4 2.03e-01 1.80e-03 1.05e-01 1.61e+00 1.12e-01 5.13e-03 9.59e-02 4.43e+00
5 3.96e-01 1.90e-01 8.90e-02 1.75e+00 5.34e-02 9.32e-02 5.13e-02 4.74e+00
4
4 4.95e-06 1.97e-16 4.92e-06 1.05e+00 3.09e-02 2.77e-15 3.09e-02 3.23e+00
5 1.08e-02 5.81e-04 2.75e-06 1.15e+00 5.84e-02 2.18e-03 5.84e-02 3.74e+00
Table 2. Maximum statistics for SPORT simulations. In sets 1 and 2, the maximum errors in
LIFOC are 2 orders of magnitude higher than SPORT, ITERLP, and FIFOC. In sets 3 and 4,
FIFOC error is 2 to 3 orders of magnitude higher than the other three algorithms.
Set m
δ = 0.2 δ = 0.5
SPORT ITERLP LIFOC FIFOC SPORT ITERLP LIFOC FIFOC
1
4 5.34e-02 3.09e-02 3.11e+00 5.61e-02 1.84e-01 7.57e-02 4.20e+00 2.02e-01
5 8.24e-02 4.87e-02 3.00e+00 8.79e-02 2.26e-01 1.19e-01 3.91e+00 2.30e-01
2
4 3.03e-02 1.69e-02 1.83e+00 3.06e-02 9.35e-02 4.93e-02 3.10e+00 1.10e-01
5 3.66e-02 2.61e-02 2.24e+00 3.68e-02 1.15e-01 8.34e-02 2.75e+00 1.26e-01
3
4 4.01e-01 3.42e-01 4.66e-01 2.02e+00 4.03e-01 2.22e-01 4.03e-01 5.44e+00
5 5.31e-01 3.86e-01 4.84e-01 2.30e+00 5.45e-01 3.80e-01 4.16e-01 6.05e+00
4
4 1.32e+00 6.50e-01 8.84e-01 4.47e+00 8.02e-01 7.11e-01 4.00e-01 1.12e+01
5 1.56e+00 7.66e-01 4.34e-01 4.85e+00 9.35e-01 8.97e-01 4.24e-01 1.15e+01
gram for all possible (m!)2 permutations of (σa, σc). In FIFOC, processors are allocated load
and result are collected in the order of decreasing communication link bandwidth of the pro-
cessors. In LIFOC, load allocation is in the order of decreasing communication link bandwidth
of the processors, while result collection is the reverse order of increasing communication link
bandwidth of the processors. ITERLP (Ghatpande, Beaumont, Nakazato &Watanabe, 2008) is
a near-optimal algorithm for DLSRCHETS. To explore the effects of system parameter values
on the performance of the algorithms, several sets of simulations were carried out:
Set 1 Homogeneous network and homogeneous processors
Set 2 Homogeneous network and heterogeneous processors
Set 3 Heterogeneous network and homogeneous processors
Set 4 Heterogeneous network and heterogeneous processors
The error values with respective to the optimal are calculated. Over 500,000 simulation runs
are carried out. Further details can be obtained in (Ghatpande, Beaumont, Nakazato &Watan-
abe, 2008; Ghatpande, Nakazato, Beaumont &Watanabe, 2008). The minimum andmaximum
www.intechopen.com
Parallel and Distributed Computing200
 0.0001
 0.001
 0.01
 0.1
 1
 10
 100
 50  100  150  200  250  300  350  400  450  500
R
e
q
u
ir
e
d
 C
o
m
p
u
ta
ti
o
n
 T
im
e
 i
n
 S
e
c
o
n
d
s
Number of Processors
SPORT
LIFOC
FIFOC
Fig. 12. Comparison of wall-clock time for SPORT, LIFOC, and FIFOC. SPORT is two orders
of magnitude faster than LIFOC and almost four orders of magnitude faster than FIFOC. This
figure appears in (Ghatpande, Nakazato, Beaumont & Watanabe, 2008).
mean error values of each algorithm are tabulated in Tables 1 and 2. It can be observed that in
sets 1 and 2, the minimum and maximum errors in LIFOC are 2 orders of magnitude higher
than SPORT, ITERLP, and FIFOC. On the other hand in sets 3 and 4, FIFOC error is 2 to 3
orders of magnitude higher than the other three algorithms.
There is a significant downside to LIFOC because of its property to use all available processors
— the time required to compute the optimal solution (wall-clock time) is almost two orders
of magnitude greater than that of SPORT as seen in Fig. 12. These values were obtained
by averaging the wall-clock time to compute a solution over 1000 runs. The results show
that though both SPORT and LIFOC are O(m) algorithms given a set of processors sorted
by decreasing communication bandwidth, clearly SPORT is the better performing algorithm,
with the best cost-performance ratio for large values of m. The values for FIFOC are almost
four orders of magnitude larger than SPORT. The extensive simulations show that:
• If network links are homogeneous, then LIFOC performance is affected for both homo-
geneous and heterogeneous computation speeds.
• If network links are heterogeneous, then FIFOC performance is affected for both homo-
geneous and heterogeneous computation speeds.
• SPORT performance is also affected to a certain degree by the heterogeneity in network
links and computation speeds, but since SPORT does not use a single predefined se-
quence of allocation and collection, it is able to better adapt to the changing system
conditions.
• ITERLP performance is somewhat better than SPORT, but is computationally expen-
sive. SPORT generates similar schedules at a fraction of the cost.
6. Conclusion
In this chapter, the DLSRCHETS problem for the scheduling of divisible loads on heteroge-
neous master-slave systems and considering the result collection phase was formulated and
www.intechopen.com
Scheduling of Divisible Loads on Heterogeneous Distributed Systems 201
analysed. A new polynomial-time algorithm, SPORT was proposed and tested. Future work
can proceed in the following main directions:
Theoretical Analysis The complexity of DLSRCHETS is still an open issue. It makes for an
interesting research topic. Is it at all possible that DLSRCHETS can be solved in poly-
nomial time? Does imposition of some additional constraints make it tractable? What
are those conditions?
Extending the SystemModel This area has a large number of possibilities for future work.
Scheduling purists may consider the system model used in this thesis to be quite sim-
plistic. As future work, the conditions (constraints on values of Ek and Ck), that min-
imize the error need to be found. An interesting area would be the investigation of
the effect of affine cost models, processor deadlines and release times. Another impor-
tant area would be to extend the results to multi-installment delivery and multi-level
processor trees.
Modification of DLSRCHETS The ways in which DLSRCHETS may be modified are — dy-
namism and uncertainty in the system parameters, non-clairvoyance, non-omniscience
of the master, node (slave) turnover (failure), slave sharing, multiple jobs on one master,
multiple masters, multiple jobs on several masters, decentralization of scheduling de-
cision (P2P model), QoS requirements, buffer, bandwidth, and computation constraints
on slaves.
Application Development All the testing in this work has been carried out using simula-
tions. It will be interesting to see how the algorithms perform in practice. New and
different applications apart from the number of possible scientific applications men-
tioned in the introduction, need to be developed that use the results in this work. This
may require development of new libraries and middleware to support the computation
models considered.
7. References
Adler, M., Gong, Y. & Rosenberg, A. L. (2003). Optimal sharing of bags of tasks in heteroge-
neous clusters, SPAA ’03: Proceedings of the fifteenth annual ACM symposium on Parallel
algorithms and architectures, ACM, New York, NY, USA, pp. 1–10.
Barlas, G. D. (1998). Collection-aware optimum sequencing of operations and closed-form
solutions for the distribution of a divisible load on arbitrary processor trees, 9(5): 429–
441.
Beaumont, O., Casanova, H., Legrand, A., Robert, Y. & Yang, Y. (2005). Scheduling divisible
loads on star and tree networks: Results and open problems, 16(3): 207–218.
Beaumont, O., Marchal, L., Rehn, V. & Robert, Y. (2005). FIFO scheduling of divisible loads
with return messages under the one-port model, Research Report 2005-52, LIP, ENS
Lyon, France.
Beaumont, O., Marchal, L., Rehn, V. & Robert, Y. (2006). FIFO scheduling of divisible loads
with return messages under the one port model, Proc. Heterogeneous Computing Work-
shop HCW’06.
Beaumont, O., Marchal, L. & Robert, Y. (2005). Scheduling divisible loads with return mes-
sages on heterogeneous master-worker platforms, Research Report 2005-21, LIP, ENS
Lyon, France.
Bharadwaj, V., Ghose, D., Mani, V. & Robertazzi, T. G. (1996). Scheduling Divisible Loads in
Parallel and Distributed Systems, IEEE Computer Society Press, Los Alamitos, CA.
www.intechopen.com
Parallel and Distributed Computing202
Cheng, Y.-C. & Robertazzi, T. G. (1990). Distributed computation for a tree network with
communication delays, 26(3): 511–516.
Comino, N. & Narasimhan, V. L. (2002). A novel data distribution technique for host-client
type parallel applications, 13(2): 97–110.
Dantzig, G. B. (1963). Linear Programming and Extensions, Princeton Univ. Press, Princeton, NJ.
Ghatpande, A., Beaumont, O., Nakazato, H. &Watanabe, H. (2008). Divisible load scheduling
with result collection on heterogeneous systems, Proc. Heterogeneous ComputingWork-
shop (HCW 2008) held in the IEEE Intl. Parallel and Distributed Processing Sysmposium
(IPDPS 2008), Miami, FL.
Ghatpande, A., Nakazato, H., Beaumont, O. & Watanabe, H. (2008). SPORT: An algorithm
for divisible load scheduling with result collection on heterogeneous systems, IEICE
Transactions on Communications E91-B(8).
Robertazzi, T. (2008). Divisible (partitionable) load scheduling research.
URL: http://www.ece.sunysb.edu/ tom/dlt.html#THEORY
Rosenberg, A. (2001). Sharing partitionable workload in heterogeneous NOWs: Greedier is
not better, IEEE International Conference on Cluster Computing, Newport Beach, CA,
pp. 124–131.
Vanderbei, R. J. (2001). Linear Programming: Foundations and Extensions, Vol. 37 of International
Series in Operations Research & Management, 2nd edn, Kluwer Academic Publishers.
URL: http://www.princeton.edu/ rvdb/LPbook/online.html
Yu, D. & Robertazzi, T. G. (2003). Divisible load scheduling for grid computing, Proc. Inter-
national Conference on Parallel and Distributed Computing Systems (PDCS 2003), Vol. 1,
Los Angeles, CA, USA.
www.intechopen.com
Parallel and Distributed Computing
Edited by Alberto Ros
ISBN 978-953-307-057-5
Hard cover, 290 pages
Publisher InTech
Published online 01, January, 2010
Published in print edition January, 2010
InTech Europe
University Campus STeP Ri 
Slavka Krautzeka 83/A 
51000 Rijeka, Croatia 
Phone: +385 (51) 770 447 
Fax: +385 (51) 686 166
www.intechopen.com
InTech China
Unit 405, Office Block, Hotel Equatorial Shanghai 
No.65, Yan An Road (West), Shanghai, 200040, China 
Phone: +86-21-62489820 
Fax: +86-21-62489821
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware
design to application development. Particularly, the topics that are addressed are programmable and
reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies,
cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale
network simulation, and parallel routines and algorithms. In this way, the articles included in this book
constitute an excellent reference for engineers and researchers who have particular interests in each of these
topics in parallel and distributed computing.
How to reference
In order to correctly reference this scholarly work, feel free to copy and paste the following:
Abhay Ghatpande, Hidenori Nakazato and Olivier Beaumont (2010). Scheduling of Divisible Loads on
Heterogeneous Distributed Systems, Parallel and Distributed Computing, Alberto Ros (Ed.), ISBN: 978-953-
307-057-5, InTech, Available from: http://www.intechopen.com/books/parallel-and-distributed-
computing/scheduling-of-divisible-loads-on-heterogeneous-distributed-systems
© 2010 The Author(s). Licensee IntechOpen. This chapter is distributed
under the terms of the Creative Commons Attribution-NonCommercial-
ShareAlike-3.0 License, which permits use, distribution and reproduction for
non-commercial purposes, provided the original is properly cited and
derivative works building on this content are distributed under the same
license.
