A realistic model and an efficient heuristic for scheduling with heterogeneous processors by Olivier Beaumont et al.
A Realistic Model and an Eﬃcient Heuristic
for Scheduling with Heterogeneous Processors
Olivier Beaumont, Vincent Boudet and Yves Robert
LIP, UMR CNRS–ENS Lyon–INRIA 5668
Ecole Normale Sup´ erieure de Lyon
69364 Lyon Cedex 07, France
e-mail: Firstname.Lastname@ens-lyon.fr
Abstract
Scheduling computational tasks on processors is a
key issue for high-performance computing. Although
a large number of scheduling heuristics have been pre-
sented in the literature, most of them target only ho-
mogeneous resources. Moreover, these heuristics of-
ten rely on a model where the number of processors
is bounded but where the communication capabilities of
the target architecture are not restricted. In this pa-
per, we deal with a more realistic model for heteroge-
neous networks of workstations, where each processor
can send and/or receive at most one message at any
given time-step. First, we state a complexity result that
shows that the model is at least as diﬃcult as the stan-
dard one. Then, we show how to modify classical list
scheduling techniques to cope with the new model. Next
we introduce a new scheduling heuristic which incor-
porates load-balancing criteria into the decision pro-
cess of scheduling and mapping ready tasks. Exper-
imental results conducted using six classical testbeds
(LAPLACE, LU, STENCIL, FORK-JOIN, DOOLIT-
TLE, and LDMt) show very promising results.
1 Introduction
The eﬃcient scheduling of application tasks is crit-
ical to achieving high performance in parallel and dis-
tributed systems. The objective of scheduling is to ﬁnd
a mapping of the tasks onto the processors, and to order
the execution of the tasks so that: (i) task precedence
constraints are satisﬁed; and (ii) a minimum schedule
length is provided.
Task graph scheduling is usually studied using the
so-called macro-dataﬂow model, which is widely used
in the scheduling literature: see the survey papers [17,
1, 4, 8] and the references therein. This model was
introduced for homogeneous processors, and has been
(straightforwardly) extended for heterogeneous com-
puting resources. In a word, there is a limited number
of computing resources, or processors, to execute the
tasks. Communication delays are taken into account
as follows: let task T be a predecessor of task T0 in
the task graph; if both tasks are assigned to the same
processor, no communication overhead is paid, the ex-
ecution of T0 can start right at the end of the execution
of T; on the contrary, if T and T0 are assigned to two
diﬀerent processors Pi and Pj, a communication delay
is paid. More precisely, if Pi ﬁnishes the execution of
T at time-step t, then Pj cannot start the execution
of T0 before time-step t + comm(T,T0,Pi,Pj), where
comm(T,T0,Pi,Pj) is the communication delay (which
depends upon both tasks T and T0 and both processors
Pi and Pj). Because memory accesses are typically one
order of magnitude cheaper than inter-processor com-
munications, it makes good sense to neglect them when
T and T0 are assigned to the same processor.
However, the major ﬂaw of the macro-dataﬂow
model is that communication resources are not lim-
ited. First, a processor can send (or receive) any num-
ber of messages in parallel, hence an unlimited num-
ber of communication ports is assumed (this explains
the name macro-dataﬂow for the model). Second, the
number of messages that can simultaneously circulate
between processors is not bounded, hence an unlimited
number of communications can simultaneously occur
on a given link. In other words, the communication net-
work is assumed to be contention-free, which of course
is not realistic as soon as the processor number exceeds
a few units.
We strongly believe that the macro-dataﬂow task
graph scheduling model should be modiﬁed to take
communication resources into account. Recent pa-
1
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE pers [13, 14, 23, 22] made a similar statement and in-
troduced variants of the model (see the discussion in
Section 2). In this paper, we suggest to use the bi-
directional one-port architectural model, where each
processor can communicate (send and/or receive) with
at most another processor at a given time-step. In
other words, a given processor can simultaneously send
a message, receive another message, and perform some
(independent) computation. The next section (Sec-
tion 2) is devoted to a brief discussion of all these
scheduling models: (i) the macro-dataﬂow model ex-
tended to deal with heterogeneous resources, (ii) the
variants suggested in the literature referenced above,
and (iii) the bi-directional one-port model.
The new one-port model turns out to be computa-
tionally even more diﬃcult than the macro-dataﬂow
model: in Section 3, we prove that scheduling a sim-
ple fork graph with an unlimited number of homo-
geneous processors is NP-hard. Note that this prob-
lem has polynomial complexity in the macro-dataﬂow
model [11]: we have to resort to fork-join graphs to get
NP-completeness in the macro-dataﬂow model [4].
An impressive list of scheduling heuristics has been
proposed in the literature for the macro-dataﬂow model
with a limited number of homogeneous processors (see
the tutorial [1] and the references therein). More re-
cently, several heuristics have been introduced to deal
with diﬀerent-speed processors [16, 18, 26, 21, 3]. Un-
fortunately, all these heuristics assume no restriction
on the communication resources, which renders them
somewhat unrealistic to model real-life applications.
Section 4 is devoted to the design and analysis of a
new heuristic targeted to scheduling task graphs with
a limited number of diﬀerent-speed processors, under
the bi-directional one-port communication model.
In Section 5, we report simulation results from
comparisons conducted using six classical testbeds:
LAPLACE, LU, STENCIL, FORK-JOIN, DOOLIT-
TLE, and LDMt. We obtain very favorable results. Fi-
nally, some concluding remarks are given in Section 6.
2 Models
2.1 The macro-dataﬂow model
In this section, we brieﬂy recall the macro-dataﬂow
model, which is widely used in the scheduling litera-
ture [1]. This model was introduced for homogeneous
processors but has been extended to deal heteroge-
neous computing resources. For each task schedul-
ing algorithm, the input is composed of two entities:
(i) a directed vertex-weighted edge-weighted acyclic
graph G = (V,E,w,c), that models the application
to be scheduled; (ii) set of computing resources P =
(P,t,link) that models the target computing resources
(processors and communication network).
V = {vi : i = 1,··· ,N} is a set of N nodes (or
tasks). Each task v ∈ V has a nonnegative compu-
tation cost w(v) which is deﬁned as the amount of
computation cycles needed to process it. P = {Pi :
i = 1,··· ,p} is a set of p processors. Each processor
Pi has a cycle-time ti, which is deﬁned as the inverse
of its (relative) speed. For instance if processor P1 is
twice faster, say, than processor P2, then t2 = 2t1. The
number of time-steps required to execute a task v on
processor Pi is the product w(v)×ti of the task compu-
tation cost by the processor cycle-time. If all processors
are identical, then we let ti = 1 for 1 ≤ i ≤ p. For each
task vi, σ(vi) is the time-step at which its execution be-
gins. We let alloc(vi) be the number of the processor
which vi is assigned to. Any processor can compute and
communicate simultaneously, but can execute at most
one task at each time-step. Tasks are non-preemptive:
once started on a given processor, their execution must
continue until completion.
Each edge ei,j ∈ E corresponds to a precedence con-
straint from task vi to task vj and is labeled with a
communication volume data(i,j), which is the num-
ber of data items to be transferred from vi to vj after
the execution of vi. For each edge eij : vi → vj, if
vi is executed on processor Pq and vj on processor Pr
(in other words if alloc(vi) = q and alloc(vj) = r),
we have the scheduling constraint σ(vi) + w(vi) × tq +
comm(i,j,q,r) ≤ σ(vj) which states that the execution
of vj on Pr cannot start before the end of the execution
of vi on Pq, i.e. σ(vi)+w(vi)×tq, plus some communi-
cation overhead comm(i,j,q,r) that is detailed below
(and chosen to be zero whenever q = r).
The communication matrix link models the time
needed to transfer a single data item from one proces-
sor to another. As stated above, we assume that the
main diagonal of the 2D matrix link is composed of zero
entries. The communication overhead comm(i,j,q,r)
is equal to comm(i,j,q,r) = data(i,j) × link(q,r), i.e.
the product of the message length by the capacity of
the communication link.
The objective function is to minimize
the makespan, or scheduling length, i.e.
maxv∈V
 
σ(v) + w(v) × talloc(v)

. This scheduling
problem is NP-complete in the macro-data ﬂow model,
even for simple fork-join graphs with an inﬁnite
number of same-speed processors (ti = 1 for all i),
and a fully homogeneous communication network
(link(i,j) = 1 forall i 6= j): see [4].
2
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE 2.2 Communication-aware models from the liter-
ature
Communication-aware models restrict the use of
communication links in various manners. In the model
proposed by Sinnen and Sousa [23, 22, 24], the un-
derlying communication network is no longer fully-
connected. There are a limited number of communica-
tion links, and each processor is provided with a routing
table which speciﬁes the links to be used to communi-
cate with an other processor (hence the routing is fully
static). The major modiﬁcation is that at most one
message can circulate on one link at a given time-step,
so that contention for communication resources is taken
into account.
Similarly, Hollermann et al. [13] and Hsu et al. [14]
target networks of processors and introduce the follow-
ing model: each processor can either send or receive a
message at a given time-step (bidirectional communi-
cation is not possible); also, there is a ﬁxed latency
between the initiation of the communication by the
sender and the beginning of the reception by the re-
ceiver. Still, the model is rather close to the standard
one-port model discussed below.
Finally, note that there are several other papers that
include restrictions on the communication resources:
these include work by Tan et al. [25], Orduna et al. [19]
and Roig et al. [20].
2.3 The bi-directional one-port model
As stated above, communication resources are taken
into account for the bi-directional one-port model. This
is quite natural, and quite similar to the assumptions
made for computation resources in the macro-dataﬂow
model.
Formally, we keep all previous notations and
scheduling rules, and we add the following new rule: at
a given time-step, any processor can communicate with
at most another processor in both directions: sending
to and receiving from another processor. We also as-
sume communication/computation overlap (but as be-
fore, a processor can execute at most one task at each
step). Note that several communications can occur
in parallel, provided that they involve disjoint pairs
of sending/receiving processors. The one-port model
nicely models switches like Myrinet that can imple-
ment permutations [6] or even multiplexed bus archi-
tectures [15].
Several variants could be considered: no commu-
nication/computation overlap, uni-directional commu-
nications, or even a combination of both restrictions.
But the bi-directional one-port model seems closer to
the actual capabilities of modern processors.
v1 v2 v3 v4 v5 v6
v0
Figure 1. Task graph for the example: all
weights (nodes and communications) are
equal to 1.
Serializing communications performed by the pro-
cessors has a dramatic impact on the scheduling
makespan. Consider the following simple example of
the task graph represented in Figure 1: w(vi) = 1 for
0 ≤ i ≤ 6 and data(0,i) = 1 for 1 ≤ i ≤ 6. Assume ﬁve
same-speed processors and a fully homogeneous net-
work: ti = 1 for 1 ≤ i ≤ 5 and link(i,j) = 1 for
1 ≤ i,j ≤ 5,i 6= j. In the macro-dataﬂow model, we
assign v0 and the ﬁrst two children v1 and v2 to proces-
sor P0. We assign one of the remaining children v3, v4,
v5 and v6 to each remaining processor. Processor P0
executes task v0 at time-step 0; then P0 can perform
all the four communications in parallel at time-step 1.
The total makespan is then equal to 3. In the one-port
model, the same allocation of tasks to processors would
lead to a makespan at least 6: 1 for the parent task,
4 for the four messages to be sent sequentially, and 1
for the last task to be executed. One optimal solu-
tion is to assign three children tasks to P0 and one re-
maining child task to a distinct processor (which makes
one processor useless), for a makespan equal to 5. It
is clear that communications from the parent node to
the children has become the bottleneck. Of course we
could use larger task graphs and greater communica-
tion costs to come up with arbitrarily large diﬀerences
in the makespans.
3 Complexity
In this section, we prove a NP-completeness re-
sult for the one-port scheduling model. A N-children
fork-graph is a task-graph of N + 1 nodes labeled
v0,v1,...,vN, as illustrated in Figure 2. There is an
edge directed from the parent node v0 to each child
node vi, 1 ≤ i ≤ N. To simplify notations, we let
3
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE wi = w(vi) for 0 ≤ i ≤ N and di = data(0,i) for
1 ≤ i ≤ N. We target a simple architecture with an
unlimited number of same-speed processors and a fully
homogeneous communication network. With the above
notations, p = N + 1 (we never need more processors
than tasks), ti = 1 for 1 ≤ i ≤ p, link(i,j) = 1 for
1 ≤ i,j ≤ p,i 6= j (and link(i,i) = 0 for 1 ≤ i ≤ p).
v1 v2 vN
v0
vi
d1
d2
di
dN
w0
w1 w2 wi wN
Figure 2. A fork-graph
Given a fork graph and the target architecture (un-
limited number of same-speed processors connected
through a fully homogeneous network), the decision
problem is the following:
Deﬁnition 1 FORK-SCHED(G,P,T): Given a fork-
graph G of N+1 nodes, a set P of an unlimited number
of same-speed processors connected through a fully ho-
mogeneous network, and given a time-bound T, is there
a valid schedule σ whose makespan is not greater than
T?
Theorem 1 The FORK-SCHED(G,P,T) decision
problem is NP-complete.
Proof We use a reduction from 2-PARTITION, a
well-known NP-complete problem [9]: given a set of
n integers A = {a1,...,an}, is there a partition of
{1,...,n} into two subsets A1 and A2 such that
X
i∈A1
ai =
X
i∈A2
ai ?
We start with an arbitrary instance of 2-
PARTITION, i.e. a set A = {a1,...,an} of n integers.
We have to polynomially transform this instance into
an instance of the FORK-SCHED problem which has a
solution iﬀ the original instance of 2-PARTITION has
a solution.
We let 2S =
Pn
i=1 ai (if the sum is odd there is no
solution to the instance of 2-PARTITION). Let M =
max1≤i≤n ai and m = min1≤i≤n ai. We construct the
following instance of FORK-SCHED:
• the fork-graph has N +1 nodes, where N = n+3
• the parent node v0 has weight w0 = 0
• for 1 ≤ i ≤ n, the i-th child node vi has weight
wi = 10(M + ai + 1)
• the last three children have the same weight
wn+1 = wn+2 = wn+3 = 10(M + m) + 1. Let
wmin denote this common value, this is indeed the
minimum of wi,1 ≤ i ≤ n + 3.
• data volumes: for 1 ≤ i ≤ n + 3, di = wi
• time bound: T = 1
2
Pn
i=1 wi + 2wmin = 5n(M +
1) + 10S + 20(M + m) + 2
Note that wmin ≤ wi ≤ 2wmin for 1 ≤ i ≤ n
(straightforward veriﬁcation). Clearly, the size of the
constructed instance of FORK-SCHED is polynomial
(even linear) in the size of the original instance of 2-
PARTITION.
Assume that the original instance of 2-PARTITION
admits a solution: let A1 and A2 be a partition of
{1,...,n} such that
P
i∈A1 ai =
P
i∈A2 ai = S. We
derive a scheduling for the instance of FORK-SCHED
as follows:
• Processor P0 is assigned the execution of node v0,
nodes vi,i ∈ A1 and nodes vn+1 and vn+2. Obvi-
ously, P0 needs exactly T units of time to process
these tasks.
• Each other node is assigned to a distinct processor,
hence we are using |A2|+1 processors in addition
to P0
• The ordering of the communication messages sent
by P0 is by increasing values of the index i; in
particular, the last message sent by P0 is to node
vn+3
• The processor responsible for node vn+3 completes
the reception of the message from P0 at time-step P
i∈A2 di +dn+3, and terminates the execution at
time-step
P
i∈A2 di + dn+3 + wmin = T
• All the other processors terminate their execu-
tion earlier, because they receive their message not
later than
P
i∈A2 di and their execution time wi is
not greater than 2wmin
Therefore, we have derived a valid scheduling that
matches the time-bound, hence a solution to the
FORK-SCHED instance.
4
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE Reciprocally, assume that the FORK-SCHED
instance admits a solution, i.e. a valid schedul-
ing σ that achieves the time-bound T. Let
P0 be the processor which executes v0, and
A = {i,1 ≤ i ≤ n + 3,alloc(vi) = 0} be the in-
dex set of the tasks assigned to P0. The processing
time of P0 is thus at least A =
P
i∈A wi. All the
remaining tasks are assigned to other processors
than P0. The processor which receives the last
message from P0 to execute a task, say, vlast (whose
index is not in A), cannot complete execution before
time-step B =
P
i/ ∈A,1≤i≤n+3 di + wlast. Since σ
achieves the time bound, max(A,B) ≤ T. But
A + B =
Pn+3
i=1 wi + wlast = 2T + wlast − wmin,
hence A = B = T and wlast = wmin. Since A = B,
A ≡ B mod 10, hence A contains exactly two indices of
the set {n+1,n+2,n+3}. We let A1 be equal to A mi-
nus these two indices and A2 = {1,...,n}\A1 to derive
a solution to the original instance of 2-PARTITION.
4 Heuristics
In this section, we introduce a new heuristic for the
one-port model. This heuristic builds upon ideas from
the HEFT heuristic and from the ILHA heuristic, both
designed for the macro-dataﬂow model. We brieﬂy re-
view these heuristics before discussing their adaptation
to the one-port model.
4.1 HEFT for the macro-dataﬂow model
In this section we brieﬂy describe the Heterogeneous
Earliest Finish Time (HEFT) heuristic introduced by
Topcuoglu, Hariri and Wu [26] for the macro-dataﬂow
model. This heuristic is a natural extension of list-
scheduling heuristics to cope with heterogeneous re-
sources. More in particular, HEFT builds upon the old
Modiﬁed Critical Path heuristic [10, 7] and use bottom
levels to assign priorities to tasks.
More precisely, the HEFT heuristic works as follows:
• the task graph is traversed so that the bottom level
of each task is computed. The bottom level of a
task is deﬁned as the length of the longest path
that leads to an exit node in the graph (intuitively,
the longer the path, the more urgent the task).
• bottom levels are used to assign priorities to tasks
• at each step, a ready task (i.e. a task whose prede-
cessors have all been scheduled) with highest pri-
ority is selected for scheduling
• the task is assigned to the processor that allows
the earliest completion time, taken into account all
previous decisions; the task is then marked“sched-
uled” and the list of ready tasks is updated
Further explanations are in order. First, how to
compute bottom levels with diﬀerent speed processors?
Because the length of a path in the graph is the sum
of computation and communication times, we need to
properly average those to deﬁne bottom levels in this
context, as explained below.
As for computation times, assume that there are p
available processors of respective cycle-times t1,...,tp.
Assume also that there is a collection of several inde-
pendent tasks of total weight W. Ideally, these tasks
should be distributed to processors so that the load is
equally balanced. Processor Pi should receive a frac-
tion ci (with 0 ≤ ci ≤ 1) of the total weight W such
that its processing time (ciW)ti is the same as that of
all processors. We derive
ci =
1
ti Pp
i=1
1
ti
.
The tasks are processed within W Pp
i=1
1
ti
time-units by
the p processors. We deduce that the weight w(T) of a
given task should be estimated by the quantity
p×w(T) Pp
i=1
1
ti
when computing bottom-levels.
Similarly, the weight of a communication should be
multiplied by a factor estimated to the average band-
width of the links (replace link(q,r) by the inverse of
the harmonic mean). Note that all communication
costs are accounted for in the calculation of bottom
levels. In other words, it is (conservatively) estimated
that communications cannot be avoided (by assigning
the source and the sink to the same processor).
4.2 ILHA for the macro-dataﬂow model
In a previous paper [3], we have introduced the Iso-
Level Heterogeneous Allocation (ILHA) heuristic for
the macro-dataﬂow model. In a word, the main charac-
teristic of the ILHA heuristic is a better load-balancing
at each decision step, which is achieved by considering
a chunk of several ready tasks rather than a single one;
the idea is to allocate to each processor a number of the
tasks in the chunk that is proportional to its computing
power.
We have outlined how to achieve a good load-
balancing in the previous section: each processor Pi
with cycle-time ti should receive a fraction ci of the
total work of size W to be executed. There is a slight
complication due to the fact that tasks are indivisible
5
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE units of computation, so that the values ci may have
to be replaced by approximations. For instance when
W corresponds to n independent tasks, each requiring
the same amount of work, Pi receives cin tasks, which
is an integer for all 1 ≤ i ≤ p only if n is a multi-
ple of C = lcm(t1,t2,··· ,tp)
Pp
j=1
1
tj, a quantity that
may be very large. For the general case, the following
algorithm provides the best solution [2]:
Optimal distribution
1: ∀i ∈ {1,...,p}, ci =

1
ti Pp
i=1
1
ti
× n

.
2: for m = c1 + c2 + ... + cp to n
3: ﬁnd k ∈ {1,...,p} such that
tk×(ck+1) = min{ti×(ci+1))}
4: ck = ck + 1
We are ready for a ﬁrst outline of the ILHA heuristic.
We split the task graph into levels made up of indepen-
dent tasks, by considering the tasks that will be ready
at the same time-step. In other words, two tasks belong
to the same level if they have the same top-level, using
the terminology of [27]. This is done by a traversal of
the graph. Initially, the 0-level is composed of the en-
try tasks. The (i+1)-th level groups the tasks that are
ready when the i-th level is achieved. A ﬁrst version of
the ILHA algorithm is the following: we traverse the
task graph to split it into levels made of independent
tasks. We compute the number of tasks that we allo-
cate to each processor using the load-balancing algo-
rithm. Once this is done, we have to determine exactly
which task is given to each processor. The criteria is to
minimize the communication costs. So for each task of
the level, we consider its predecessors. If they are all
allocated to the same processor, we try to allocate the
task to the same processor (i.e. if the processor may
receive another task), otherwise, we allocate the task
to the fastest processor that is not yet saturated (able
to receive new tasks according to the load-balancing
strategy).
In the previous version of the ILHA algorithm, we
process all the ready tasks at each step. In some cases,
it would be better to take into account the bottom level
of the ready tasks and to consider ﬁrst the tasks on a
critical path. To this purpose, we sort the ready tasks
according to their bottom level. Then, we introduce
a parameter B, the maximal number of ready tasks
that will be considered at each step. We consider those
B tasks with the higher bottom levels and we allocate
them using the load balancing algorithm. Then, we up-
date the set of ready tasks (indeed some new tasks may
have become ready) and we re-sort them according to
their bottom level. Thus, we expect that the tasks on
a critical path will be processed as soon as possible.
Unfortunately, we face a tradeoﬀ for choosing an ap-
propriate value for B. On one hand if B is large, it will
be possible to better balance the load and minimize the
communication cost. On the other hand, a small value
of B will enable us to process the tasks on the critical
path sooner. Of course B must be at least equal to the
number of processors, otherwise some processors would
be kept idle. We obtain the ﬁnal version of the ILHA
algorithm:
The ILHA algorithm
1: Compute the bottom level of each task
2: ReadyTask ← {Entry tasks} sorted by
decreasing value of their bottom level
3: While ReadyTask is not empty
4: Take the B ﬁrst tasks of the
ReadyTask
5: Compute the optimal distribution
with B tasks
6: For each task t of ReadyTask
7: If all predecessors of t are on p and
p is free
8: Assign t to p
9: For each task t of ReadyTask not
yet assigned
10: Assign t to the ﬁrst free processor
11: Update the list of ReadyTask by in-
serting the new ready tasks in the
sorted list
12:End while
The ILHA heuristic was compared [3] with ﬁve
heuristics taken from the literature: the minimum
Partial Completion Time static priority (PCT) heuris-
tic [16], the Best Imaginary Level (BIL) heuristic [18],
the Critical Path on a Processor (CPOP) heuristic [26],
the Generalized Dynamic Level (GDL) heuristic [21]
and the previous HEFT heuristic. For the experimen-
tal comparisons, we have used six classical testbeds
(LAPLACE, LU, STENCIL, FORK-JOIN, DOOLIT-
TLE, and LDMt: see Section 5). All these comparisons
showed that the best results are obtained for HEFT and
ILHA. We now proceed to adapting these two heuris-
tics to the one-port model.
4.3 HEFT for the one-port model
Modifying HEFT for the one-port model is not diﬃ-
cult. When the highest priority ready-task is selected,
we still search for the processor that allows earliest
completion time. But now we have to take constraints
in communication resources. This means that in addi-
tion to scheduling the selected task we must also sched-
ule eventual incoming communications. Since we have
6
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE access to current communication schedules for all pro-
cessors, we can assign the new communications as early
as possible, in a greedy fashion.
Consider the following example with three proces-
sors P1, P2 and P3. Assume that the selected task
T has two incoming edges, one from a task T1 al-
ready allocated to processor P1 and the other from
a task T2 already allocated to processor P2. If we
try to allocate the selected task to P1, we can ne-
glect the ﬁrst communication. We schedule the com-
munication from P2 to P1 as soon as possible, with
the one-port constraint: we look for the ﬁrst avail-
able time-interval during which P2 is not sending and
P1 is not receiving. This interval must start after the
completion of the source task on P2 and must be long
enough so that the entire communication, of duration
comm(T1,T,P2,P1) = data(T1,T) × link(P2,P1), can
take place. In passing, note that the model can eas-
ily be extended to the case where the interconnection
network is such that messages must be routed between
some processor pairs: if there is no direct link from
P2 to P1, we redo the previous step for all interme-
diate messages between adjacent processors. Having
scheduled the communications, we can now look at the
computation schedule of P1 to ﬁnd the earliest possi-
ble starting time for the execution of the selected task.
We thus derive the completion time on P1. We take the
minimum completion time on P1, P2, and P3 to decide
which processor will execute the task.
4.4 ILHA for the one-port model
Modifying ILHA for the one-port model is more
challenging. This is because several ready tasks are
dealt with simultaneously, which leads to handle many
more communications.
As before, we select B (independent) ready tasks of
highest priority (i.e. of largest bottom level). Let W be
the sum of the weights of these B tasks. To minimize
the number of communications, we then proceed in two
steps:
Step 1 We scan the list of B tasks (starting with high-
est priorities ﬁrst), and check whether a given task
T can be assigned without generating any commu-
nication , i.e. whether all the parents of T have
already been allocated to the same processor, say
Pi. In that case, we allocate T to processor Pi,
provided that the current workload of Pi does not
exceed the fraction ciW of the total work. Here
ci is the value returned by the load-balancing al-
gorithm discussed above; if Pi happens to have
already received its share of the work, we do noth-
ing and proceed to the next task in the list, until
the list is exhausted.
Step 2 At the end of Step 1, some tasks have been
allocated to processors. We suppress them from
the list of B ready tasks, which we scan a second
time, in the same order. We use the same strategy
as in HEFT to allocate the tasks: we select the
processor that allows for the earliest completion
time.
a0
a1 a2 a3
b0
b1 b2 b3 ab1 ab2
Figure 3. A toy example: all computation and
communication costs are equal to 1.
To exemplify the diﬀerences between HEFT and
ILHA, consider the toy-example represented in Fig-
ure 3 and assume two same-speed processors P0
and P1 are available (t0 = t1 = link(P0,P1) =
1). The bottom level of all the children nodes is
the same, so assume they are ranked in the order
a1,a2,a3,ab1,ab2,b3,b2,b1. In the following, a task
that would end at the same time-step on both proces-
sors is always assigned to P0 (this is just an arbitrary
way to break ties).
t t
P0
P1
P0 −→ P1
P1 −→ P0
1 1 6 5 ILHA HEFT
a0 a1 a2 ab1 b2
a3 ab2
b0 a3 ab2 b3 b1
ab1 b2
a0 a1 a2 a3 ab1
ab2
ab1
ab2 b0 b3 b2 b1
Figure 4. HEFT and ILHA scheduling for the
toy example.
HEFT ﬁrst schedules a0 on P0 and b0 on P1. Then
a1 is assigned to P0. Next a2 is assigned to P0 again,
because of the tie-breaking rule. After that a3 goes to
P1, and so on: see Figure 4.
ILHA also start by scheduling a0 on P0 and b0 on
P1. After that, if B ≥ 8, ILHA beneﬁts from its global
view: it assigns no-communication tasks, i.e. a1, a2
and a3 to P0 and b3, b2 and b1 to P1. Note that
c1 = c2 = 0.5, hence each processor could receive
7
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE up to 4 tasks in this allocation step. Next we turn
to HEFT scheduling, as outlined in Figure 4. Note
that the makespan is smaller, but also that the num-
ber of communication has dramatically been reduced.
Reducing communications while achieving a good load
balance is the objective that has guided the design of
ILHA.
Note that several variations in the design of ILHA
could be implemented. First, there is no reason to limit
the scan of the B ready tasks to those tasks incur-
ring no communication. We could add another scan
for tasks that can be scheduled at the price of a single
communication, and so on. Second, and more impor-
tantly, we could limit the use of HEFT at Step 2 to a
pre-allocation of tasks to processors, and re-schedule all
communications in a third step. In other words, after
Step 2 all tasks have been allocated to processors. We
can forget about the schedule times computed during
Step 1 or during Step 2 (using HEFT) and keep only
the allocation function. We then try to re-schedule the
whole set of B tasks: indeed, the scheduling has been
made simpler, because the allocation is known. Unfor-
tunately, this scheduling problem remains NP-complete
(see the proof in the Appendix). Still, we could use
greedy-like heuristics to improve the scheduling after
the allocation resulting from the two scans.
5 Simulation results
5.1 Testbeds
In order to compare the diﬀerent algorithms, we con-
sider six classical kernels representing various types of
parallel algorithms. The selected task graphs are:
• LU: LU decomposition
• LAPLACE: Laplace equation solver
• STENCIL: stencil algorithm
• FORK-JOIN: fork-join graph
• DOOLITTLE: Doolittle reduction
• LDMt: LDMt decomposition
Miniature versions of each task graph are shown in Fig-
ures 5 and 6
5.2 Weights and speeds
Task weights For the LAPLACE, STENCIL, and
FORK-JOIN testbeds, all tasks have same weight,
which we normalize to 1. For the linear algebra
The LU task graph.
The LAPLACE task graph.
The stencil task graph.
Figure 5. The ﬁrst three testbeds.
8
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE The fork-join task graph.
The DOOLITTLE task graph.
The LDMt task graph.
Figure 6. The last three testbeds.
testbeds, i.e. LU, DOOLITTLE and LDMt, the situa-
tion is more complicated, because the amount of work
to be done at each step of the algorithm is not constant
(see [12, 5]). For the LU kernel, the weight of a task at
level k is N − k, where N is the size of the graph. For
the DOOLITTLE and LDMt kernels, the the weight of
a task at level k is k, where k varies from 1 to N, the
size of the graph.
Processor speeds We use 10 processors: ﬁve pro-
cessors with cycle time 6, three processors with cycle
time 10, and two processors with cycle time 15. Re-
member that the time to execute a task is the prod-
uct of its weight by the processor cycle-time. The
speedup that can be achieved with these 10 processors
is bounded as follows:
With this set of processors, the smallest value to
perfectly load-balance the work is B = 38. Indeed we
give 5 tasks to each processor of cycle time 6 (hence
30 tasks), 3 tasks to each processor of cycle time 10
(hence 9 tasks) and ﬁnally 2 tasks to each processor
of cycle time 15 (hence 4 tasks). So in 30 time-units
we process 25 + 9 + 4 = 38 tasks. To compute these
38 tasks in a sequential way, using one of the fastest
processors, we would need 38 × 6 = 228 time-units.
So we may improve the sequential time by a factor at
most 228
30 = 7.6. Note that this is only an upper bound,
since all communication costs are neglected here, and
since it is assumed that no dependence would keep any
processor idle at any time-step.
Communication costs For each testbed, we let
the communication costs be proportional to the task
weights: indeed in each kernel, we always communicate
the data that has just been updated. In other words,
the communication cost from a task v to a task v0 is
equal to c times the weight of v, where c is a parameter
that models the communication-to-computation ratio
of the target platform. Because we want to stress the
impact of communication costs, we use a large value
for c: we let c = 10, which is rather representative of
workstations linked with a slow (Ethernet) network.
5.3 Results
We start with the FORK-JOIN kernel: see Figure 7.
We see that HEFT and ILHA lead to the same schedul-
ing. The value of B has no impact on ILHA in this case
(we used B = 38 in the experiments). The speedup is
quite limited: the gain is 1.58, to be compared with
the theoretical bound of 7.6. But in fact, the schedul-
ing found by both heuristics is eﬃcient. Indeed, to
reach a speedup factor s with homogeneous processors,
9
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE 1.53
1.54
1.55
1.56
1.57
1.58
1.59
100 150 200 250 300 350 400 450 500
r
a
t
i
o
 
(
e
x
e
c
u
t
i
o
n
 
t
i
m
e
)
/
(
s
e
q
u
e
n
t
i
a
l
 
t
i
m
e
)
Problem size
ILHA
HEFT
Figure 7. Comparison of HEFT and ILHA for
the FORK-JOIN problem, with 10 processors
and a communication cost equal to 10.
we would have to generate s−1
s × N communications,
where N is the number of intermediate nodes in the
graph. Even if we succeeded in overlapping all these
communications with computations, we would not do
better than c × s−1
s × N + 3 × wt time-steps, where w
is the task weight, t the processor cycle-time and c the
communication cost. The sequential time is equal to
(N + 2) × wt, hence the speedup is s ≈ wt
cs−1 (with N
large enough). This leads to s ≤ wt
c + 1. Here with
t = 6, c = 10 et w = 1, the bound is 1.6: contrarily
to the appearances, 1.58 turns out to be a very good
result!
We continue with the LU decomposition kernel: see
Figure 8. Here the best value for B has been experi-
mentally found to be B = 4. This small value can be
explained as follows: the shape of the LU task graph
is such that the critical path must be executed rapidly,
hence the need for a smaller value of B. We point out
that HEFT and ILHA achieve similar performances
for n = 100, but ILHA gains more and more as the
problem size increases. For n = 500 ILHA obtains a
speedup equal to 5, while HEFT is limited to 4.5.
Results are similar for the LAPLACE, LDMt and
DOOLITTLE kernels (see Figures 9, 10 and 11). For
each of them, ILHA roughly gains 10% over ` a HEFT.
For LAPLACE we used B = 38: all nodes are on a
critical path, and a larger value of B allows both to
load-balance computations and to minimize commu-
nications. For n = 500, ILHA achieves a speedup
equal to 5.6. For DOOLITTLE and LDMt, the best
value for B is B = 20, a tradeoﬀ between a good load-
3.8
4
4.2
4.4
4.6
4.8
5
5.2
5.4
100 150 200 250 300 350 400 450 500
r
a
t
i
o
 
(
e
x
e
c
u
t
i
o
n
 
t
i
m
e
)
/
(
s
e
q
u
e
n
t
i
a
l
 
t
i
m
e
)
Problem size
ILHA
HEFT
Figure 8. Comparison of HEFT and ILHA for
the LU problem, with 10 processors and a
communication cost equal to 10.
balancing and an early processing of the critical path.
The speedup for LDMt is 4.9, and for DOOLITTLE, it
is equal to 4.4. The gain over HEFT is signiﬁcant.
Finally for the STENCIL kernel (see Figure 12), we
observe a new phenomenon: for both heuristics, the
speedup decreases as the problem size increases. This
can be explained as follows: as the graph becomes
larger, we have to use all processors in parallel on each
row of the graph, and this induces many communica-
tions to be done sequentially, and these become the
bottleneck. ILHA obtains a low speedup equal to 2.7,
slightly better than HEFT which reaches 2.4. The op-
timal value for B is B = 38.
We point out that the best results for ILHA have
been obtained by trying several values for B. Un-
fortunately, we have not found any systematic tech-
nique to predict the optimal value of B. Note how-
ever that the range of B is limited: with equal-size
tasks and p processors of cycle-times t1,t2,...,tp, we
can sample the interval [1..M], where the value M =
lcm(t1,t2,...,tp)
Pp
i=1
1
ti ensures a perfect load bal-
ancing.
6 Conclusion
In this paper we have argued that the (bi-
directional) one-port scheduling model was more re-
alistic than the macro-dataﬂow model to design and
analyse the execution of parallel algorithms onto net-
works of workstations. Indeed, the scarcity of com-
munication resources is fully taken into account, just
10
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE 4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
100 150 200 250 300 350 400 450 500
r
a
t
i
o
 
(
e
x
e
c
u
t
i
o
n
 
t
i
m
e
)
/
(
s
e
q
u
e
n
t
i
a
l
 
t
i
m
e
)
Problem size
ILHA
HEFT
Figure 9. Comparison of HEFT and ILHA for
the LAPLACE problem, with 10 processors
and a communication cost equal to 10.
as the scarcity of computing resources was dealt with
in the macro-dataﬂow model with a limited number of
processors.
We have assessed the intrinsic complexity of task
graph scheduling under the one-port model. The NP-
completeness result obtained for fork graphs is no sur-
prise at all, but motivates the design of eﬃcient heuris-
tics. We have shown how to extend the HEFT heuris-
tic [26] to cope with the new model. The HEFT heuris-
tic was already an extension of critical path scheduling
to heterogeneous computing resources, and we showed
how to serialize communications in accordance to the
one-port constraint.
We have also introduced a new heuristic, ILHA,
whose design is motivated by (i) the search for a better
load-balance and (ii) the generation of fewer commu-
nications. These goals are achieved by scheduling a
chunk of ready tasks simultaneously, which enables for
a global view of the potential communications. Prelim-
inary results conducted on six classical testbeds demon-
strate very promising results. Still, there is room for
further analysis and improvements of the ILHA heuris-
tic, as well as more extensive experimental validation
and comparisons.
References
[1] B.A.Shirazi, A. Hurson, and K. Kavi. Scheduling
and load balancing in parallel and distributed systems.
IEEE Computer Science Press, 1995.
3.8
4
4.2
4.4
4.6
4.8
5
100 150 200 250 300 350 400 450 500
r
a
t
i
o
 
(
e
x
e
c
u
t
i
o
n
 
t
i
m
e
)
/
(
s
e
q
u
e
n
t
i
a
l
 
t
i
m
e
)
Problem size
ILHA
HEFT
Figure 10. Comparison of HEFT and ILHA for
the LDMt problem, with 10 processors and a
communication cost equal to 10.
[2] V. Boudet, F. Rastello, and Y. Robert. A proposal
for a heterogeneous cluster ScaLAPACK (dense linear
solvers). In H. R. Arabnia, editor, International Con-
ference on Parallel and Distributed Processing Tech-
niques and Applications (PDPTA’99). CSREA Press,
1999. Extended version available as LIP Technical Re-
port RR-99-17.
[3] V. Boudet and Y. Robert. Scheduling heuristics for
heterogeneous processors. In 2001 International Con-
ference on Parallel and Distributed Processing Tech-
niques and Applications (PDPTA’2001), pages 2109–
2115. CSREA Press, 2001. Extended version available
(on the Web) as Technical Report 2001-22, LIP, ENS
Lyon.
[4] P. Chr´ etienne, E. C. Jr., J. Lenstra, and Z. Liu, ed-
itors. Scheduling Theory and its Applications. John
Wiley and Sons, 1995.
[5] M. Cosnard, M. Marrakchi, Y. Robert, and D. Trys-
tram. Parallel Gaussian elimination on a MIMD com-
puter. Parallel Computing, 6:275–296, 1988.
[6] D. E. Culler and J. P. Singh. Parallel Computer Ar-
chitecture: A Hardware/Software Approach. Morgan
Kaufmann, San Francisco, CA, 1999.
[7] A. Darte, Y. Robert, and F.Vivien. Scheduling and
Automatic Parallelization. Birkha¨ user, 2000.
[8] H. El-Rewini, H. Ali, and T. Lewis. Task scheduling
in multiprocessing systems. Computer, 28(12):27–37,
1995.
[9] M. R. Garey and D. S. Johnson. Computers
and Intractability, a Guide to the Theory of NP-
Completeness. W. H. Freeman and Company, 1991.
[10] A. Gerasoulis and T. Yang. A comparison of cluster-
ing heuristics for scheduling DAGs on multiprocessors.
J. Parallel and Distributed computing, 16(4):276–291,
Dec. 1992.
11
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE 3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
100 150 200 250 300 350 400 450 500
r
a
t
i
o
 
(
e
x
e
c
u
t
i
o
n
 
t
i
m
e
)
/
(
s
e
q
u
e
n
t
i
a
l
 
t
i
m
e
)
Problem size
ILHA
HEFT
Figure 11. Comparison of HEFT and ILHA for
the DOOLITTLE problem, with 10 processors
and a communication cost equal to 10.
[11] A. Gerasoulis and T. Yang. On the granularity and
clustering of directed acyclic task graphs. IEEE Trans.
Parallel and Distributed Systems, 4(6):686–701, 1993.
[12] G. H. Golub and C. F. V. Loan. Matrix computations.
Johns Hopkins, 1989.
[13] L. Hollermann, T. Hsu, D. Lopez, and K. Vertanen.
Scheduling problems in a practial allocation model. J.
Combinatorial Optimization, 1(2):129–149, 1997.
[14] T. Hsu, J. C. Lee, D. Lopez, and W. Royce. Task
allocation on a network of processors. IEEE Trans.
Computers, 49(12):1339–1353, 2000.
[15] K. Hwang and Z. Xu. Scalable Parallel Computing.
McGraw-Hill, 1998.
[16] M. Maheswaran and H. J. Siegel. A dynamic match-
ing and scheduling algorithm for heterogeneous com-
puting systems. In Seventh Heterogeneous Computing
Workshop. IEEE Computer Society Press, 1998.
[17] M. Norman and P. Thanisch. Models of machines and
computation for mapping in multicomputers. ACM
Computing Surveys, 25(3):103–117, 1993.
[18] H. Oh and S. Ha. A static scheduling heuristic
for heterogeneous processors. In Proceedings of Eu-
ropar’96, volume 1123 of LNCS, Lyon, France, Aug.
1996. Springer Verlag.
[19] J. Orduna, F. Silla, and J. Duato. A new task mapping
technique for communication-aware scheduling strate-
gies. In T. Pinkston, editor, Workshop for Schedul-
ing and Resource Management for Cluster Computing
(ICPP’01), pages 349–354. IEEE Computer Society,
2001.
[20] C. Roig, A. Ripoll, M. Senar, F. Guirado, and
E. Luque. Improving static scheduling using inter-task
concurrency measures. In T. Pinkston, editor, Work-
shop for Scheduling and Resource Management for
2.3
2.35
2.4
2.45
2.5
2.55
2.6
2.65
2.7
2.75
100 150 200 250 300 350 400 450 500
r
a
t
i
o
 
(
e
x
e
c
u
t
i
o
n
 
t
i
m
e
)
/
(
s
e
q
u
e
n
t
i
a
l
 
t
i
m
e
)
Problem size
ILHA
HEFT
Figure 12. Comparison of HEFT and ILHA for
theSTENCILproblem, with10processorsand
a communication cost equal to 10.
Cluster Computing (ICPP’01), pages 375–381. IEEE
Computer Society, 2001.
[21] G. Sih and E. Lee. A compile-time scheduling heuristic
for interconnection-constrained heterogeneous proces-
sor architectures. IEEE Transactions on Parallel and
Distributed Systems, 4(2):175–187, 1993.
[22] O. Sinnen and L. Sousa. Comparison of contention-
aware list scheduling heuristics for cluster comput-
ing. In T. Pinkston, editor, Workshop for Schedul-
ing and Resource Management for Cluster Computing
(ICPP’01), pages 382–387. IEEE Computer Society,
2001.
[23] O. Sinnen and L. Sousa. Exploiting unused time-
slots in list scheduling considering communication con-
tention. In R. Sakellariou, J. Keane, J. Gurd, and
L. Freeman, editors, EuroPar’2001 Parallel Process-
ing, pages 166–170. Springer-Verlag LNCS 2150, 2001.
[24] O. Sinnen and L. Sousa. Scheduling task graphs
on arbitrary processor architectures considering con-
tention. In High Performance Computing and Net-
working, pages 373–382. Springer-Verlag LNCS 2110,
2001.
[25] M. Tan, H. Siegel, J. Antonio, and Y. Li. Minimizing
the aplication execution time through scheduling of
subtasks and communication traﬃc in a heterogeneous
computing system. IEEE Transactions on Parallel and
Distributed Systems, 8(4):857–1871, 1997.
[26] H. Topcuoglu, S. Hariri, and M.-Y. Wu. Task schedul-
ing algorithms for heterogeneous processors. In Eighth
Heterogeneous Computing Workshop. IEEE Computer
Society Press, 1999.
[27] T. Yang and A. Gerasoulis. DSC: Scheduling parallel
tasks on an unbounded number of processors. IEEE
Trans. Parallel and Distributed Systems, 5(9):951–967,
1994.
12
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE Biographies
Olivier Beaumont was born in 1970 in Saint-
Etienne. He obtained a PhD thesis from the Universit´ e
de Rennes in January 1999. He is currently an associate
professor in the Computer Science Laboratory LaBRI
in Bordeaux. His main research interests are parallel
algorithms on distributed memory architectures.
Vincent Boudet was born in 1974 in Nogent sur
Marne, France. He obtained a PhD thesis from ENS
Lyon in December 2001. He is currently a post-doc re-
searcher in the Computer Science Laboratory LIP at
ENS Lyon. He is mainly interested in algorithm de-
sign and in compilation-parallelization techniques for
distributed memory architectures.
Yves Robert was born in 1958 in Lyon, France. He
obtained a PhD thesis from Institut National Polytech-
nique de Grenoble in January 1986. He is currently a
full professor in the Computer Science Laboratory LIP
at ENS Lyon. He is the author of three books, more
than 80 papers published in international journals, and
more than 90 papers published in international confer-
ences. His main research interests are parallel algo-
rithms for distributed memory architectures and auto-
matic compilation/parallelization techniques. He is a
member of ACM and IEEE, and an associate editor for
IEEE Trans. Parallel and Distributed Systems.
Appendix
In this section, we prove that scheduling the com-
munications in a bipartite graph, after having allocated
the tasks to the processors, is a NP-complete problem.
The link with ILHA is obvious: one subset of the nodes
represent the B ready tasks which are currently under
examination, and the other subset represents their par-
ents: communication links are directed from the latter
to the former.
Deﬁnition 2 COMM-SCHED(G,P,T): Given a bi-
partite graph G of task nodes V = V1 ∪ V2, a ﬁnite set
P of same-speed processors connected through a fully
homogeneous network, such that each task node is as-
signed a processor number in P, and edges from V1
to V2 are assigned a communication cost, and given
a time-bound T, is there a valid schedule σ whose
makespan is not greater than T?
Theorem 2 The COMM-SCHED(G,P,T) decision
problem is NP-complete.
Proof As for the FORK-SCHED problem, we use a
reduction from 2-PARTITION. We start with an ar-
bitrary instance of 2-PARTITION, i.e. a set A =
{a1,...,an} of n integers. We have to polynomially
transform this instance into an instance of the COMM-
SCHED problem which has a solution iﬀ the original
instance of 2-PARTITION has a solution, i.e. iﬀ there
exists a partition of {1,...,n} into two subsets A1 and
A2 such that X
i∈A1
ai =
X
i∈A2
ai
We let 2S =
Pn
i=1 ai (if the sum is odd, there is no
solution to the instance of 2-PARTITION). We con-
struct the following instance of COMM-SCHED:
• there are 3n + 1 tasks: a fork-
graph with parent v0 and children
v1,v2,...,vn, and n separated pairs of tasks
(v2n+1,vn+1),(v2n+2,vn+2),...,(v3n,v2n); each
pair has an edge v2n+i −→ vn+i: see Figure 13).
• there are 2n + 1 processors P0,P1,...,Pn of same
speed: ti = 1,0 ≤ i ≤ n
• task v0 is assigned to P0 and for 1 ≤ i ≤ n, tasks
vi and vn+1 are assigned to processor Pi
• for 1 ≤ i ≤ n, task v2n+i is assigned to processor
Pn+i
• all computation times are equal to zero: w(vi) =
0,0 ≤ i ≤ 3n
13
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE v0
v1 v2 vn−1 vn
vn+1 vn+2 v2n−1 v2n
v2n+1 v2n+2 v3n−1 v3n
S S S S
a1
a2
an
Pn+1 Pn+2 P2n−1 P2n
P0
P2 P1 Pn−1 Pn
an−1
Figure 13. The instance of COMM-SCHED.
• communication times: for 1 ≤ i ≤ n, di =
data(v0,vi) = ai and data(v2n+i,vn+i) = S.
• homogeneous network: link(Pi,Pj) = 1 if i 6= j
• time bound: T = S
Clearly, the size of the constructed instance of
COMM-SCHED is polynomial (even linear) in the size
of the original instance of 2-PARTITION.
Assume that the original instance of 2-PARTITION
admits a solution: let A1 and A2 be a partition of
{1,...,n} such that
P
i∈A1 ai =
P
i∈A2 ai = S. We
derive a scheduling for the instance of COMM-SCHED
as follows:
• At time-step t = 0, processor Pn+i sends its mes-
sage to processor Pi, 1 ≤ i ≤ n
• Processor P0 sends messages to nodes vi such that
i ∈ A1 (in any order); then, at time-step S, it
sends messages to sends messages to nodes vi such
that i ∈ A2 (in any order)
• If i ∈ A1, processor Pi ﬁrst executes vi, as soon as
it has received the data from P0. Then, no later
than at time-step S =
P
i∈A2 di, it executes task
vn+i
• If i ∈ A2, processor Pi ﬁrst executes task vn+i;
then it executes task vi, as soon as it has received
the data from P0
Therefore, we have derived a valid scheduling that
matches the time-bound, hence a solution to the
COMM-SCHED instance.
Reciprocally, assume that the COMM-SCHED in-
stance admits a solution, i.e. a valid scheduling σ that
matches the time-bound T. Then P0 sends all its n
messages without any idle-time. If at time-step S pro-
cessor P0 is in the middle of an emission, say that of
message number j, then processor Pj has not enough
time to receive data from Pn+j for task vn+i, either
before or after receiving the message from P0. Hence
at time-step S P0 is just completing the sending of one
message, hence the solution to 2-PARTITION.
For the reader that would be worried with exe-
cution times equal to 0, it is not diﬃcult to modify
the construction to derive an instance with execution
times equal to 1 (but this slightly complicates the
proof, hence our choice).
14
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS￿02) 
1530-2075/02 $17.00 ' 2002 IEEE 