Scheduling independent tasks on multi-cores with GPU accelerators by Bleuse, Raphaël et al.
Scheduling independent tasks on multi-cores with GPU
accelerators
Raphae¨l Bleuse, Safia Kedad-Sidhoum, Florence Monna, Gre´gory Mounie´,
Denis Trystram
To cite this version:
Raphae¨l Bleuse, Safia Kedad-Sidhoum, Florence Monna, Gre´gory Mounie´, Denis Trystram.
Scheduling independent tasks on multi-cores with GPU accelerators. Concurrency and Com-
putation: Practice and Experience, Wiley, 2015, 27 (6), pp.1625-1638. <10.1002/cpe.3359>.
<hal-01081625>
HAL Id: hal-01081625
https://hal.archives-ouvertes.fr/hal-01081625
Submitted on 10 Nov 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES
WITH GPU ACCELERATORS
RAPHAEL BLEUSE1, SAFIA KEDAD-SIDHOUM2, FLORENCE MONNA1,2,
GRE´GORY MOUNIE´2, AND DENIS TRYSTRAM2,3
Abstract. More and more computers use hybrid architectures combining
multi-core processors and hardware accelerators like GPUs (Graphics Process-
ing Units). We present in this paper a new method for scheduling efficiently
parallel applications with m CPUs and k GPUs, where each task of the appli-
cation can be processed either on a core (CPU) or on a GPU. The objective
is to minimize the maximum completion time (makespan). The corresponding
scheduling problem is NP-hard, we propose an efficient approximation algo-
rithm which achieves an approximation ratio of 4
3
+ 1
3k
. We first detail and
analyze the method, based on a dual approximation scheme, that uses dynamic
programming to balance evenly the load between the heterogeneous resources.
Then, we present a faster approximation algorithm for a special case of the
previous problem, where all the tasks are accelerated when affected to GPU,
with a performance guarantee of 3
2
for any number of GPUs. We run some
simulations based on realistic benchmarks and compare the solutions obtained
by a relaxed version of the generic method to the one provided by a classical
scheduling algorithm (HEFT). Finally, we present an implementation of the
4/3-approximation and its relaxed version on a classical linear algebra kernel
into the scheduler of the xKaapi runtime system.
1. Introduction
Most of the computing systems available today include parallel multi-core chips
sharing a large memory with additional hardware accelerators [1]. There is an in-
creasing complexity within the internal nodes of such parallel systems, mainly due
to the heterogeneity of the computational resources. In order to take advantage of
the benefits offered by heterogeneity in terms of performance, effective and auto-
matic management of the hybrid resources will be more and more important for
running any applications. These new hybrid architectures have given rise to new
scheduling problems consisting in allocating and sequencing the computations on
the different resources such that a given objective is optimized.
The main challenge is to create adequate generic methods and software tools that
fulfill the requirements for optimizing the performances. In the field of High Per-
formance Computing (HPC), one of the most studied optimization problem is the
minimization of the maximum completion time (makespan) of the schedule, which
is the objective addressed in this paper. Some Polynomial Time Approximation
Schemes (PTAS) exist for problems of minimizing the makespan on heterogeneous
processors [2, 3], but their running times make them impractical for solving sched-
uling problems on actual computing platforms.
1 Univ. Grenoble-Alpes, LIG, F-38000 Grenoble, France., CNRS, LIG, F-38000 Greno-
ble, France, Inria
2 Sorbonne Universite´s, UPMC Univ.Paris 06, UMR 7606, LIP6, F-75005, Paris, France
3 Institut Universitaire de France
Key words and phrases. Scheduling; Approximation algorithms; Parallel heterogeneous
systems.
This work was partially supported by the French Ministry of Defense (DGA)..
1
2 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
In the field of parallel processing, there exist a huge number of papers dealing
with implementations of ad hoc algorithms using GPUs or hybrid architectures.
They expand over several aspects of parallelism from operating system, runtime,
application implementation or languages. However, only few of them focus on the
intermediate problem of scheduling on hybrid platforms [4]. Most of the works in
the literature consist in studying the gains and performances of parallel implemen-
tation of some specific numerical kernels [5, 6], specific applications like multiple
alignments of biological sequences [7], or molecular dynamics [8]. The existing
scheduling algorithms and tools are usually not well-suited for general purpose ap-
plications since the internal hardware organization of a GPU highly differs from a
CPU and thus, the GPU should be considered as a new type of resource in order
to determine efficient approaches. Scheduling is usually done on a case by case
basis and often offers good performances, however, it lacks high-level mechanisms
that provide transparent and efficient schedules for any application. Some actual
runtime systems include the basic mechanisms for developing scheduling algorithms
like OMPSS [9], StarPU [10] or xKaapi [11]. Several scheduling algorithms have
been implemented on top of these systems. An online algorithm with a performance
guarantee [12] has recently been developed for CPU-GPU platforms, but there is
no performance guarantee for any oﬄine problem on these systems.
Our objective within this work is to propose a new method for scheduling inde-
pendent tasks on hybrid CPU-GPU architectures designed for HPC. The considered
input is a set of independent sequential tasks whose execution times are known. This
hypothesis is realistic, since some computing systems such as StarPU have a module
which estimates the execution times at compile time. The method that we propose
in this work determines the allocation and schedule of the tasks to the computing
units, CPUs and GPUs. We present and analyze in detail this methodology for the
case of scheduling n sequential tasks on m cores (CPUs) and k GPUs. This leads
to an efficient approximation algorithm which achieves a ratio 43 +
1
3k + ǫ using
dual approximation [13] with a dynamic programming scheme. The computational
cost of this algorithm is in O
(
n2k3m2
)
per step of dual approximation. We also
present an approximation algorithm for the more specific problem of scheduling a
set of taskson the same hybrid platform that are all accelerated when processed on
a GPU, although they can have different accelerations. This algorithm also relies
on the dual approximation technique and achieves a 32 performance ratio for a time
complexity in O (mn log n). As the method with a 43 approximation ratio is rather
costly, in the perspective of an integration into actual runtime systems, we derive
a relaxed algorithm and compare it experimentally with one of the most popular
algorithms (HEFT [14]), as well as with the approximation algorithm presented for
the case where all the tasks are accelerated on GPUs.
The outline of the paper is as follows. A formal description of the scheduling
problem with k GPUs is provided in Section 2, and some related works are pre-
sented. We propose in Section 3 a new approach for solving the problem. We
propose in Section 4 an algorithm for the scheduling problem with k GPUs where
all tasks are accelerated when processed on GPU. We report the results of exper-
iments in Section 5 where a relaxed version of the method presented in Section 3
and the algorithm from Section 4 are compared to the classical HEFT algorithm on
simulations built from realistic workloads. The experimental analysis shows that
the proposed methods have a more stable behavior than HEFT for a similar per-
formance. Then, an implementation of both the generic method and its relaxed
version on a real run-time system have been realized and tested on a classical Lin-
ear Algebra kernel. Finally, a synthesis is provided in Section 6, which opens some
perspectives.
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 3
Some of this work was presented at the HeteroPar 2013 conference [15], but the
approximation algorithm for the specific problem of scheduling a set of tasks that
are all accelerated when processed on a GPU is an entirely new contribution, and
the experiments were extended in comparison to the original paper, where only
partial simulations for the main algorithm were presented. Practically, the cost of
the 43 algorithm is too expensive since the number of independent tasks is too low
in the considered kernel. The relaxed 2-approximation is shown to be as good as
HEFT, but provides a more stable behavior and a lower volume of communications.
2. Problem Definition and Related Works
We consider a multi-core parallel platform composed of m identical CPUs and
k identical GPUs. The m CPUs are considered independent from the GPUs that
are commanded by some extra driving CPUs, not mentioned here because they do
not execute any task. An application is composed of n independent tasks denoted
by T1, . . . , Tn. The tasks are considered sequential, meaning that they only are
processed on one processor of any type, CPU or GPU. Each of these tasks has two
processing times depending on which type of processor it is allocated to. The pro-
cessing time of task Tj is denoted by pj if Tj is processed on a CPU and pj if it is
processed on a GPU. The acceleration factor of task Tj will be given by the ratio
pj
pj
.
We assume that both processing times of a task are known in advance as it is com-
monly admitted. An accurate estimation can be obtained at compile time for regu-
lar numerical applications in HPC. The objective here is to minimize the makespan
Cmax of the whole schedule. It is defined as the maximum completion time of the
last finishing task on both CPUs and GPUs, Cmax = max(C
CPU
max , C
GPU
max ). The
problem will be denoted as (Pm,Pk) || Cmax.
If both processing times are equal (pj = pj) for all j = 1, . . . , n, the prob-
lem (P1, P1) || Cmax is equivalent to the classical P2 || Cmax problem, which is
NP-hard [16]. Thus, the problem of scheduling with GPUs is also NP-hard. Our
objective is to propose efficient approximation algorithms with a performance guar-
antee. Recall that for a given problem, the approximation ratio ρA of an algorithm
A solving this problem is defined as the maximum over all the instances I of the
ratio f(I)f∗(I) where f is any minimization objective and f
∗ is its optimal value [13].
The problem considered here is a new problem, harder than the classical sched-
uling problem on uniform machines Q || Cmax, but better results can be expected if
the problem is considered on its own rather than as an unrelated scheduling prob-
lem R || Cmax [17]. Lenstra et al. [18] proposed a PTAS for the problem R || Cmax
with running time bounded by the product of (n+ 1)m/ǫ and a polynomial of the
input size. Let us notice that if m is not fixed, then the algorithm is not fully
polynomial. The authors also proved that unless P = NP , there is no polynomial-
time approximation algorithm for R || Cmax with an approximation factor less than
3
2 and they presented a 2-approximation algorithm. This algorithm is based on
rounding an optimal solution of the preemptive version of the problem obtained
by an integer linear program. Shmoys and Tardos [19] generalized this technique
to obtain the same approximation factor for the generalized assignment problem.
Furthermore, they generalized the rounding technique to hold for any fractional
solution. Recently, Shchepin and Vakhania [20] introduced a new rounding tech-
nique which yields an improved approximation factor of 2− 1m for R || Cmax for a
similar time complexity as [18]. According to our knowledge, this is so far the best
approximation result for this problem. However, the prohibitive computational cost
of these algorithms prevents their usage on actual computing platforms.
4 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
It is worth noticing that if all the tasks of the problem have the same accel-
eration on the GPUs, the problem reduces to the classical Q || Cmax problem,
with two machine speeds. For Qm || Cmax, Friesen [21] proved that the ap-
proximation ratio of the well-known Longest Processing Time (LPT) scheduling
policy satisfies 1.52 6 CLPTmax /C
∗
max 6 5/3. The first PTAS for Q || Cmax was
given by Hochbaum and Shmoys [2]. The overall running time of the algorithm is
O((logm + log(3/ǫ))(m/ǫ)(n/ǫ)1/ǫ). However, these solving methods would only
work for specific instances of the problem of scheduling on hybrid platforms, where
the acceleration factors
pj
pj
would be equal to a constant, which is not the case in
practice.
Another direction is to consider the problem of scheduling on unrelated machines
of few different types. Indeed, the R || Cmax reference problem can be refined to
better fit the constraints of the hybrid platforms. Bonifaci and Wiese [3] presented
a PTAS to solve a scheduling problem with unrelated machines of few different
types. The tools used in their solving method are somewhat similar to the ones
used for solving R || Cmax, and the rounding phases of the algorithm require a
significant amount of time, raising the time complexity of the algorithm to an
impractical level, even when only two types of machines are considered, as it is for
(Pm,Pk) || Cmax. There is a need to consider other algorithms than these PTAS
to design algorithms that could be implemented on actual platforms. A PTAS
with a reasonable time complexity has been developed for the online version of the
problem of the assignment of sporadic tasks on hybrid platforms [22]. However,
an oﬄine version of the problem with non-periodic tasks has not been studied and
the algorithm cannot be trivially extended to the problem (Pm,Pk) || Cmax. On
another side, Imreh [23] presented different greedy algorithms for the problem of
scheduling on two sets of identical machines, with varying approximation ratios
including 2+ m−1k and 4−
2
m (for m CPUs and k GPUs). These algorithms are fast
enough for being implemented in modern platforms, nevertheless the approximation
ratios of these algorithms are quite high since usually the number of GPUs is much
lower than the number of CPUs.
From a practical perspective, the scheduling strategy is a key point for the per-
formance of an application. Tuning scheduling algorithms for a specific case (prob-
lem and computer architecture) is common. Fast heuristics without performance
guarantee are often used on computing platforms, time efficiency being the crucial
factor. However, simple strategies are not sufficient to guarantee the performance
for more general cases potentially far from the specific one. The performance porta-
bility is difficult to achieve when the number of CPUs and GPUs varies or when
the speedup of the various parts of the application is evolving during the execution.
Our objective within this work is to build a bridge between purely theoretical
algorithms with good performance guarantees and practical low cost heuristics.
Thus, we propose a tradeoff solution with a provable performance guarantee and a
reasonable time complexity.
3. A New Algorithm
3.1. Rationale of the solving method. The proposed algorithm is based on the
dual approximation technique [13]. A g-dual approximation algorithm for a generic
problem takes a real number λ (guess) as an input, assumes that there exists a
schedule of length at most λ and either delivers a schedule of makespan at most
gλ, or answers correctly that there exists no schedule of length at most λ. A binary
search is used to try different guesses to approach the optimal makespan as follows:
we first take an initial lower bound Bmin and an initial upper bound Bmax of the
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 5
optimal makespan. We start by solving the problem with a λ equal to the average
of these two bounds and then the bounds are updated as follows:
• If the algorithm returns “NO”, then λ becomes the new lower bound.
• If the algorithm returns a schedule of makespan at most gλ, then there
exists a schedule of makespan at most λ and λ becomes the new upper
bound.
The number of iterations of the binary search is bounded by log2 (Bmax −Bmin) .
Hence, a g-dual approximation algorithm can be converted, by bisection search, in
a g(1 + ǫ)-approximation algorithm with a similar running time.
We target g = 43 +
1
3k . Let λ be the current real number input (guess) for the
dual approximation. The key point is to show how it is possible to build a schedule
of length at most 4λ3 +
λ
3k , starting from the assumption that there exists a schedule
of length lower than λ.
The idea is to partition the set of tasks on the CPUs into two sets, each consisting
in two shelves: a first set with a shelf of length λ and the other of length λ3 , and a
second set with two shelves of length 2λ3 . The partition ensures that the makespan
on the CPUs is lower than 4λ3 . The same partition can be applied to the set of tasks
on the GPUs, with the smallest shelf of length λ3 +
λ
3k instead of
λ
3 . Since the tasks
are independent, the scheduling strategy is straightforward when the assignment of
the tasks has been determined and yields directly a solution of length at most 4λ3 .
The main problem is to assign the tasks in each shelf on the CPUs or on the GPUs in
order to obtain a feasible solution. This will be done using dynamic programming.
The main steps are summarized in the following algorithmic scheme:
(1) Compute the guess λ = Bmin+Bmax2 where Bmin (resp. Bmax) is a lower
(resp. upper) bound of the optimal makespan.
(2) Search for an allotment of the tasks such that:
• the total load (work) on CPUs is at most mλ,
• the total load (work) on GPUs is at most kλ,
• the tasks alloted to the CPUs (resp. GPUs) whose processing time is
strictly greater than 2λ3 occupy a maximum number of CPUs (resp.
GPUs) denoted by µ (resp. κ).
• the tasks alloted to the CPUs (resp. GPUs) whose processing time
is strictly greater than λ3 and lower than
2λ
3 can be assigned two by
two to a maximum number of CPUs (resp. GPUs) denoted by µ′/2
(resp.κ′/2).
The total number of CPUs (resp. GPUs) must not exceed m (resp.
k), i.e. µ+ µ′/2 ≤ m (resp. κ+ κ′/2 ≤ k),
• the tasks assigned to the CPUs (resp. GPUs) with processing time
lower that λ3 can be scheduled such that the induced makespan on the
CPUs (resp. GPUs) will be at most equal to 4λ3 (resp.
4λ
3 +
λ
3k ).
(3) If such an allotment does not exist, adjust the bound Bmin to λ and restart
the process (Step 1).
(4) If such an allotment exists, build the corresponding schedule with sets of
shelves such that the makespan is lower than 43λ, adjust the bound Bmax
to λ and restart the process.
3.2. Structure of an Optimal Schedule of length at most λ. We introduce
an allocation function π(j) of a task Tj which corresponds to the processor where
the task is processed. The set C (resp. G) is the set of all the CPUs (resp. GPUs).
Therefore, if a task Tj is assigned to a CPU, we can write π(j) ∈ C. We define
WC as the computational area of the CPUs on the Gantt chart representation of
a schedule, i.e. the sum of all the processing times of the tasks allocated to the
6 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
CPUs: WC =
∑
j / π(j)∈C
pj . This corresponds to the computational load of all the
CPUs.
To take advantage of the dual approximation paradigm, we have to make explicit
the consequences of the assumption that there exists a schedule of length at most
λ. We state below some straightforward properties of such a schedule. They should
give the insight for the construction of the solution.
Property 1. In an optimal solution, the execution time of each task is at most λ,
and the computational area on the CPUs (resp. GPUs) is at most mλ (resp. kλ).
Property 2. In an optimal solution, if there exist two tasks executed on the same
processor such that one of these tasks has an execution time greater than 2λ3 , then
the other one has an execution time lower than λ3 .
Property 3. Two tasks with processing times on CPU (resp. GPU) greater than
λ
3 and lower than
2λ
3 can be executed successively on the same CPU (resp. GPU)
within a time at most 4λ3 .
The basic idea of the solution that we propose comes from the analysis of the
shape of an optimal schedule. From Property 2, the tasks whose execution times
on CPU (resp. on GPU) are strictly greater than 2λ3 do not use more than m
CPUs (resp. k GPUs), and hence can be executed concurrently in the first set in
a shelf denoted by S1 (resp. S5). We denote by µ (resp. κ) the number of CPUs
(resp. GPUs) executing these tasks. These shelves are represented for the CPUs in
Figure 1.
m CPUs
S4S3
µ
λ/3 2λ/3 4λ/30 λ
S1 S2
Figure 1. Partitioning the set of tasks on the CPUs into two sets
of two shelves, the first one occupying µ CPUs, the second m− µ
CPUs.
The tasks whose execution times are lower than 2λ3 and strictly greater than
λ
3 on CPU (resp. on GPU) cannot be executed on the µ CPUs (resp. κ GPUs)
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 7
occupied by S1 (resp. S5) from Property 1. Moreover, from Property 3, 2(m− µ)
(resp. 2(k − κ)) of these tasks on CPU (resp. GPU) can be executed in time at
most 4λ3 on the remaining (m − µ) CPUs (resp. (k − κ) GPUs) in the second set
and fill two shelves S3 and S4 (resp. S7 and S8) of equal length
2λ
3 .
The tasks remaining to be assigned to the CPUs (resp. GPUs) have a processing
time lower than λ3 . The µ (resp. κ) longest remaining tasks are assigned to the
first set on the CPUs (resp. GPUs) in another shelf denoted by S2 (resp. S6). The
length of S2 is
λ
3 and S6 has a length of
λ
3 +
λ
3k .
WL (resp. WR) will denote the computational area on the CPUs (resp. GPUs)
remaining idle after this allocation in the schedule of length 4λ3 +
λ
3k . WL corresponds
to the stripped area in Figure 1. Regarding the question of how the remaining tasks
fit in the constructed schedule, we state the following lemma:
Lemma 1. The tasks remaining to be assigned on the CPUs after the construction
of S1, S2, S3, S4 fit in the remaining free computational space WL between these
shelves.
Proof. The tasks remaining to be assigned after the construction of S1, . . . , S4 all
have a processing time lower than λ3 by construction and they necessarily fit into
the remaining computational space WL, otherwise the schedule would not satisfy
Property 1. The following algorithm can be used to schedule these tasks:
• Consider the remaining tasks ordered by decreasing order of processing
time on CPU T1, . . . , Tf , f being the total number of tasks remaining to be
allocated.
• At each step i, i = 1, . . . , f , allocate task Ti to the least loaded processor,
at the latest possible date. Update its load.
At each step, the least loaded processor has a load at most λ; otherwise it would
contradict the fact that the total work area of the tasks is bounded bymλ (according
to Property 1). Hence, the idle time interval on the least loaded CPU has a length
at least equal to λ3 and can contain the task Ti, which proves the correctness of the
scheduling algorithm. 
The question of how the remaining tasks to be assigned on GPUs fit in the
constructed schedule will be addressed later in the paper (see Lemma 2) .
3.3. Partitioning the Tasks into Shelves. In this section, we detail how to
fill the shelves (see Figure 1) on the CPUs (resp. GPUs) by specifying an initial
assignment of the tasks to the processors.
In order to obtain a 2-sets and 4-shelves schedule on the CPUs (resp. GPUs),
we look for an assignment satisfying the following constraints:
• (C1) The total computational area WC on the CPUs is at most mλ.
• (C2) The set T1 of tasks on the CPUs with an execution time strictly greater
than 2λ3 in the allotment, to be scheduled in S1, uses a total of at most m
processors. We still denote by µ the number of processors they use.
• (C3) The set T2 of tasks on the CPUs with an execution time lower than
2λ
3 and strictly greater than
λ
3 in the allotment, to be scheduled in S3 or
S4, uses a total of at most 2(m− µ) processors.
• (C4) The total computational area on the GPUs is at most kλ.
• (C5) The set T3 of tasks on the GPUs with an execution time strictly greater
than 2λ3 in the allotment, to be scheduled in S5, uses a total of at most k
processors. We still denote by κ the number of processors they use.
• (C6) The set T4 of tasks on the GPUs with an execution time lower than
2λ
3 and strictly greater than
λ
3 in the allotment, to be scheduled in S7 or
S8, uses a total of at most 2(k − κ) processors.
8 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
Let us notice that if constraints (C3) and (C6) are satisfied, then constraints
(C2) and (C5) will also be satisfied. Hence, constraints (C2) and (C5) are relaxed.
We define for each task Tj a binary variable xj such that xj = 1 if Tj is assigned
to a CPU or 0 if Tj is assigned to a GPU. Determining if an allotment satisfying
(C1), (C3), (C4) and (C6) exists reduces to solving a three-dimensional knapsack
minimization problem that can be formulated as follows:
W ∗C =min
n∑
j=1
pjxj(1)
s.t.
1
2
∑
2λ/3>pj>λ/3
xj +
∑
pj>2λ/3
xj 6 m(2)
1
2
∑
2λ/3>pj>λ/3
(1− xj) +
∑
pj>2λ/3
(1− xj) 6 k(3)
n∑
j=1
pj (1− xj) 6 kλ(4)
xj ∈ {0, 1}(5)
Equation (1) represents the minimal workload on all the CPUs. Constraint (2)
(resp. Constraint (3)) imposes that no more thanm (resp. k) tasks can be executed
on the CPUs (resp. GPUs) with a processing time greater than 2λ3 , we note µ =∑
pj>2λ/3
xj (resp. κ =
∑
pj>2λ/3
(1− xj)) their number; and that there cannot be
more than 2(m − µ) (resp. 2(k − κ)) tasks on the CPUs (resp. GPUs) with a
processing time lower than 2λ3 and greater than
λ
3 (cf. Constraint (C3) (resp.
(C6)))). Constraint (4) imposes an upper bound on the computational area on the
GPUs which is kλ (cf. (C4)).
We propose a dynamic programming algorithm in O
(
n2m2k3
)
to solve the knap-
sack problem. For this purpose, we first discretize the processing times of the tasks
on the GPUs. We introduce νj =
⌊
pj
λ/(3n)
⌋
to represent the number of integer time
intervals of length λ3n required for a task Tj if it is executed on the GPUs, as shown
in Figure 2. N =
∑
π(j)∈G
νj denotes the total integer number of these intervals on the
GPUs. We thus define the error on the processing time of each task ǫj = pj − νj
λ
3n
induced by this time discretization.
This result allows us to consider only N states in the dynamic programming
regarding the workload on the GPUs. The error ǫj on each task is at most
λ
3n , so if
all the tasks were assigned to one of the GPU, we would have underestimated the
processing time on this GPU by at most n λ3n =
λ
3 . Constraint (4) becomes:
(6) N =
∑
π(j)∈G
νj 6 3kn
The approximated computational area of the GPUs is at most kλ and thus, the
full computational area on GPU remains lower than kλ + λ3 . This allows us to
answer the question of how the remaining tasks to be assigned on GPUs fit in the
constructed schedule:
Lemma 2. The tasks remaining to be assigned on the GPUs after the construction
of S5, S6, S7, S8 fit in the remaining free computational space WR between these
shelves.
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 9
   
   
   
   
   
   
   
   
   
   
   
   












  
  
  
  
  
  
  
  
  
  
  
  












  
  
  
  
  
  
  
  
  
  
  
  












  
  
  
  
  
  
  
  
  
  
  
  












   
   
   
   
   
   
   
   
   
   
   
   












   
   
   
   
   
   
   
   
   
   
   
   












  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




   
   
   
   
   
   
   
   
   
   
   
   












 
 
 
 
 
 
 
 
 
 
 
 












 
 
 
 
 
 
 
 
 
 
 
 












 
 
 
 
 
 
 
 
 
 
 
 












...
0 1 2 3 4 5 6 7 8 9 10
GPU
Task T1 Task T2
λ
3n
λ
3n
ν2 times︷ ︸︸ ︷
p1
Figure 2. Rounded allocation of two tasks T1 with p1 = 6.5 and
T2 with p2 = 4.7 on a GPU
Proof. The proof is similar to the one of Lemma 1. If we modify the starting time of
the tasks of S6, currently λ, so that all the working processors complete their tasks
at 4λ3 +
λ
3k , creating an idle time interval between the end of S5 and the starting
time of S6, the load of a GPU is equal to
4λ
3 +
λ
3k minus the length of the idle time
interval.
With the same algorithm as for Lemma 1, the only problem that may occur is if
a task Ti remaining to be allocated cannot be completed before the starting time
of the tasks of S6. But at each step, the least loaded processor has a load at most
λ+ λ3k since the total work area of the tasks is bounded by k
(
λ+ λ3k
)
. Hence, the
idle time interval on the least loaded GPU has a length at least λ3 and can contain
the task Ti. 
We define WC(j, µ, µ
′, κ, κ′, N) as the minimum sum of all the processing times
of the tasks on the CPUs when the first j tasks are considered, with among the
tasks on the CPUs (resp. GPUs), µ (resp. κ) of them having processing times
greater than 2λ3 and µ
′ (resp. κ′) with λ3 < pj (resp. pj) 6
2λ
3 , and where N time
intervals are occupied on the GPUs.
We use a dynamic programming algorithm to compute the value ofWC(j, µ, µ
′, κ, κ′, N)
with the values of WC with j − 1 tasks considered that were previously computed.
The optimal value of the computational area WC on the CPUs will be given by
W ∗C = min WC (n, µ, µ
′, κ, κ′, N)
06µ6m, 06µ′62(m−µ), 06κ6k, 06κ′62(k−κ), 06N63kn
If W ∗C is greater than mλ, then there exists no solution with a makespan at most λ,
and the algorithm answers “NO” in the dual approximation framework. Otherwise,
the guess λ is large enough, we construct a feasible solution with a makespan at
most 4λ3 +
λ
3k , with the shelves and the corresponding µ, µ
′, κ, κ′ and N values.
The dynamic programming algorithm represents one step of the dual-approximation
algorithm, with a fixed guess λ. A binary search is then used to try different guesses
to approach the optimal makespan as explained in Section 3.1.
10 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
Cost Analysis. Solving the dynamic programming algorithm for a fixed value of
λ requires to consider O
(
n2m2k3
)
states, since 1 6 j 6 n, 1 6 µ 6 m, 1 6 µ′ 6
2(m − µ), 1 6 κ 6 k, 1 6 κ′ 6 2(k − κ), and 0 6 N 6 3kn. Therefore, the time
complexity of each step of the binary search is O
(
n2m2k3
)
.
4. Analysis of a Special Case
We consider, in this section, a version of the problem (Pm,Pk) || Cmax where
all the tasks are accelerated when assigned to GPU, since this is the case in most
applications, i.e pj 6 pj for j = 1, . . . , n (all the tasks do not have the same
acceleration factor on GPU). When considering this special case in the algorithm
presented in the previous section, we note no amelioration in the time complexity
of the algorithm, whereas the problem was simpler. Therefore we present an new
algorithm for this case, based on a similar scheme, but with a much lower time
complexity for an approximation ratio of 32 .
The algorithm is also based on the dual approximation technique. At each step,
we have a guess on the optimal makespan. Let us consider one step of the dual
approximation scheme and let as before λ denote the current guess.
The idea is to divide the set of tasks T into four sets of tasks, two of them whose
tasks will be assigned to a CPU, C1, C2, and the other two whose tasks will be
assigned to a GPU, G1 and G2. We denote the cardinality of set Ci (resp. Gi) by
|Ci| (resp. |Gi|), i = 1, 2.
The first step of the algorithm consists in a preliminary assignment of the tasks
of the whole set T to two of the sets: G1 and C2. This assignment of each task Tj is
done by considering the value of the processing time pj of Tj on CPU. The possible
cases are described below:
• If pj 6
λ
2 , task Tj is assigned to C2.
• Otherwise, λ2 < pj , task Tj is assigned to G1.
This assignment does not guarantee that the resulting schedule will have a
makespan lower than 32λ, even if there exists a schedule of makespan lower than
λ. However, if |G1| > k + m, there are too many tasks with a processing time
on CPU greater than λ2 to fit in a schedule of makespan lower than λ. The dual
approximation rejects this guess λ.
In the second step of the algorithm, in order to achieve the desired performance
ratio, we have to reassign some of the tasks allocated to C2 to G2 and some of the
tasks allocated to G1 to C1. We order the tasks of sets C2 and G1 in decreasing order
of the computational surface change induced when a task Tj changes from a CPU
to a GPU, pj−pj . One exception has to be made for set G1: some tasks can have a
processing time on CPU larger than λ. These tasks are too big to fit on the CPUs
with the current guess. They cannot be reassigned and are put at the end of G2, no
matter the impact they can have on the computational surfaces. The tasks of the
two sets will be examined in this new order. We can note that at most m tasks of
set G1 can be reassigned to C1. The following steps are therefore repeated at most
m+ 1 times, as long as we have WC > mλ+
λ
2 or WG > kλ+
λ
2 :
• While WG =
∑
Ti∈G1∪G2
pi 6 kλ, assign the first task of set C2 to G2.
• If |C1| < m, assign the first task from G1 to C1.
With this assignment, the computational area on the CPUs has been reduced to a
minimum with the constraint of keeping the computational area on the GPUs lower
than kλ+ λ2 . Therefore, the value of WC obtained by our algorithm is smaller than
the value of the computational area on the CPUs of the optimal schedule, the most
accelerated tasks having been assigned to the GPUs. Therefore, if WC > mλ +
λ
2
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 11
, we conclude that the value of λ is too small and adjust the bounds of our binary
search accordingly.
If WC 6 mλ +
λ
2 , we can construct a feasible schedule with a makespan lower
than 32λ with the previous algorithm. Indeed, the number of tasks in C1 is lower
than m so we can build a shelf S1 as we did in Section 3, occupying |C1| CPUs,
with a length at most λ. The same arguments given in the proof of Lemma 1 can
be used for building a shelf S2 of length
λ
2 and all the tasks from C2 can be fitted in
the schedule as before. For the GPUs, the algorithm makes sure that the number
of tasks in G1 is lower than k and that WG does not go over the bound of kλ +
λ
2
so shelves similar to S1 and S2 can be built easily. However, we did not have to
make any discretization on the processing times of the tasks assigned to the GPUs
here, so, contrary to Section 3, we get the same performance ratio of 32 for any
number k of GPUs. The time complexity of an algorithm based on this principle
is in O (mn log n).
5. Experiments
We run some experiments in order to show the efficiency of the proposed algo-
rithmic scheme. They are two-fold, we first run a relaxed version of our algorithm
by simulations on random instances and compared them to the classical reference
algorithm HEFT [14] used on several actual systems (HEFT stands for Heteroge-
neous Earliest Finishing Time). Then, we implement the algorithm on the top of
the xKaapi runtime system.
5.1. Preliminary: Analysis of HEFT. HEFT proceeds in two phases, starting
by a prioritization of the tasks that are sorted by decreasing average execution time
and then the processor selection is obtained with the heterogeneous earliest finish
time rule.
Lemma 3. The worst case performance ratio of HEFT is larger than m/2.
Proof. We show on the following instance that the prioritizing phase can provide
a schedule whose makespan is far from the optimum. Let us consider an instance
with a list of the following tasks: m tasks such that p = 1 and p = ǫ and for
i = 0, · · · ,m − 1: a single task of type A such that p = 1 − i/m and p = 1 − i/m
and m − 1 tasks of type B such that p = 1 − i/m and p = 1/m2. All these tasks
are executed faster on the GPUs.
For this instance, HEFT fills first the m CPUs. Then, it fills alternatively the
GPU with one task of type A and the m CPUs with m tasks of size B. HEFT ends
with a makespan equal to m/2 + 3/2 − 1/m. It is easy to check that the optimal
makespan is equal to 1. 
5.2. Simulations. The dual approximation based algorithm for problem (Pm,Pk) ||
Cmax presented in Section 3 provides a performance ratio of
4
3 +
1
3k with a reason-
able time complexity. However, this running cost is not comparable to the one of
HEFT which basically only needs to sort the tasks (n log n). This method can be
modified in order to obtain a slightly worse performance ratio of 2 in a much slower
time O(n2k), which would make it comparable to HEFT in terms of running time
and still provide a performance guarantee. The idea is based on leaving aside the
constraints ordering the tasks into shelves. The only constraint that remains is
the one on the computational area on the GPUs being lower than kλ, λ being the
current guess of the dual approximation. With the optimal computational area on
the CPUs under this constraint determined by dynamic programming, we build a
schedule with a makespan lower than twice the optimal value. This 2-approximation
algorithm, denoted in short by DP in what follows, was implemented and compared
12 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
to HEFT by simulations based on various classes of instances. Moreover, for the
special case where all the tasks are accelerated, we implemented the algorithm pre-
sented in Section 4, denoted by Accel, which provides a performance ratio of 32 with
a time complexity of O (mn log n). All these algorithms are implemented in C++
programming language and run on a 3.4 GHz PC with 15.7 Gb RAM.
We report below a series of experiments run on random instances of various sizes:
from 10 to 1000 tasks, with a step of 10 tasks, 2a CPUs, a varying from 0 to 6, and 2b
GPUs, b varying from 0 to 3. For each combination of these sizes, 30 instances were
considered, bringing us to a total of 10500 tested instances. The processing times on
the CPUs are randomly generated using the uniform distribution U [10, 100]. The
distribution of the acceleration factors on the GPUs has been measured in [24] using
the classical numerical kernels of Magma [25] in a multi-core multi-GPU machine
hosted by the Grid’5000 infrastructure experimental platform [26]. We extracted
a distribution of the acceleration factors which reflects the qualitative speed-up on
classical numerical kernels: we assign to each task an acceleration factor αj =
pj
pj
of 1/15 or 1/35 with a probability of 1/2. The resulting processing times on the
GPUs are thus pj = αjpj . Since in this generation scheme all the tasks of these
instances are accelerated on GPU, DP, HEFT and Accel were all compared on these
instances. The running time of the three algorithms is always under one second,
even for the largest instances. We calculated the mean and maximal deviations of
the makespans of the solutions returned by these algorithms from the lower bound
of the makespan derived from the binary search of the approximation algorithm,
over all the instances.
Table 1. Maximal deviations (%) for DP and HEFT
n 120 160 220 260 360 380 660 700 760 780 920 940
DP 76.88 72.73 70.37 69.14 70.00 70.00 67.42 50.82 42.77 54.47 91.77 63.07
HEFT 123.53 98.44 92.55 91.90 110.37 91.78 113.48 98.10 98.77 103.15 116.46 96.31
Accel 46.15 42.86 50.00 41.18 37.82 43.59 36.36 40.91 32.52 37.04 48.24 34.65
Figure 3. Mean deviations of DP and HEFT for various n
As can seen in Table 1, the maximal deviations of DP are usually below the
maximal deviations of HEFT and more importantly, these deviations respect the
theoretical performance guarantee in the case of DP whereas the maximal deviations
of HEFT sometimes go over the 100% barrier corresponding to a performance ratio
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 13
of 2. The same can be said for Accel, with maximal deviations staying below the
50% barrier corresponding to a performance ratio of 32 .
Figure 3 shows that in average, DP even outperforms HEFT for large instances.
However, Accel remains slightly above HEFT, staying close to its 32 bound but never
going above it, contrary to HEFT which does not have a performance guarantee,
as seen in Table 1. This better performance ratio of 32 with its low cost make Accel
preferable to DP for practical intensive use with still a better maximum performance
guarantee.
5.3. Experiments on a real run-time. We investigate in this section the prac-
tical use of the algorithms of the previous family. We target classical linear algebra
kernels, since they are extensively used and they generate a loop of independent
tasks. Here, the computations consist in a series of independent tasks. DP (with
ratio 2), HEFT and 43 -approximation algorithms were implemented in the scheduler
of the xKaapi run-time system [11].
Implementation of the 43 -approximation algorithm. In most linear algebra
applications, the block size is the most important characteristic to maximize the
performance. Indeed, the block size has a direct impact on memory transfers be-
tween the host and the accelerators, cache effects and task graph size .
In the following experiments of Figure 4, we study the variation of the compu-
tation time as a function of the block size for the same matrix size of a Cholesky
factorisation extracted from the MAGMA library. We use a single GPU and the
matrix is decomposed over simple square blocks. The experiments have been con-
ducted on a quad-core Intel i7-3840QM with hyperthreading and a Nvidia Quadro
K1000M GPU.
Figure 4. Execution time of a Cholesky factorization scheduled
by DP, 4/3 and HEFT for various block sizes, on 3 hyperthreaded
core and a single GPU
As the block size decreases, the number of independent tasks increases. Thus,
the computation time of the scheduling using the 43 -approximation algorithm in-
creases quadratically with the number of tasks too. As a result, the scheduling time
dominates the execution time saved for large block size (which corresponds to small
number of tasks). The 43 -approximation algorithm is therefore usable mostly for
14 R. BLEUSE, S. KEDAD-SIDHOUM, F. MONNA, G. MOUNIE´, AND D. TRYSTRAM
cases where the computation time is larger than the scheduling time. It is probably
not the best suited algorithm for linear algebra kernels.
Practical issues: DP versus HEFT. We compare now the relaxed version of the
algorithm (DP) with HEFT on a machine with several GPUs. The experiments have
been conducted on a heterogeneous, multi-GPU system composed of two six-core
Intel Xeon X5650 CPUs running at 2.66 GHz with 72 GB of memory. This parallel
system is enhanced with eight NVIDIA Tesla C2050 GPUs (Fermi architecture) of
448 GPU cores (scalar processors) running at 1.15 GHz each (2688 GPU cores total)
with 3 GB GDDR5 per GPU (18 GB in total). It has 4 PCIe switches to support
up to 8 GPUs. When 2 GPUs share a switch, their aggregated PCIe bandwidth is
bounded by the one of a single PCIe 16x.
Algorithm Gflops Memory transfer/GB
HEFT 535 2.62
DP 565 1.91
Table 2. Perfomance of DP and HEFT for Cholesky factorization
with m=4 CPUs and k=8 GPUs
The implementation of DP was combined with an improved local mapping for
minimizing the data transfers [27]. With 8 GPUs, DP outperforms HEFT both in
the raw performance and memory transfers (see Table 2). The execution times are
close to each other in all cases, but DP has the three major following advantages:
• the guarantee on the scheduling results,
• a decrease in the volume of communication,
• no need of accurate communication model.
6. Concluding Remarks
In this paper, we presented a new scheduling algorithm using a generic method-
ology (in the opposite of specific ad hoc algorithms) for hybrid architectures (multi-
core machine with GPUs). We proposed fast algorithms with a constant approxi-
mation ratio in the case of independent tasks and an faster version in the special
case where all the tasks are accelerated when assigned to GPU. A ratio of 43+
1
3k +ǫ
is achieved for k GPUs in the general case, and a ratio of 32 + ǫ when the tasks are
all accelerated on GPU. The main idea of the approach is to determine an adequate
partition of the set of tasks on the CPUs and the GPUs using a dual approximation
scheme. A simulation and experimental analysis on a real run-time systemm have
been provided to assess the computational efficiency of the proposed methods. The
main conclusion is that these algorithms are stable because of their approximation
guaranties, however, the high running time is often dominated by the cost of the
scheduling it-self, leading to inefficiency if the number of tasks is too small. Ac-
cording to our experimental setting, the relaxed version with approximation ratio
equal to 2 was the best trade-off.
As further investigations of this work, we plan to extend the analysis to more
generic problems where the tasks are dependent with larger number of tasks. We
also plan on testing the robustness of the presented algorithms to perturbations in
the execution times that are only estimated on real-life computing platforms.
References
[1] Lee VW, Kim C, Chhugani J, Deisher M, Kim D, Nguyen AD, Satish N, Smelyanskiy M,
Chennupaty S, Hammarlund P, et al.. Debunking the 100x gpu vs. cpu myth: an evaluation
of throughput computing on cpu and gpu. ISCA, Seznec A, Weiser UC, Ronen R (eds.),
ACM, 2010; 451–460.
SCHEDULING INDEPENDENT TASKS ON MULTI-CORES WITH GPU ACCELERATORS 15
[2] Hochbaum DS, Shmoys DB. A polynomial approximation scheme for scheduling on uniform
processors: using the dual approximation approach. SIAM Journal on Computing 1988;
17(3):539–551.
[3] Bonifaci V, Wiese A. Scheduling unrelated machines of few different types. CoRR 2012;
abs/1205.0974.
[4] Pinel F, Dorronsoro B, Bouvry P. Solving very large instances of the scheduling of independent
tasks problem on the gpu. Journal of Parallel Distrib. Comput. 2012; .
[5] Agullo E, Augonnet C, Dongarra J, Faverge M, Ltaief H, Thibault S, Tomov S. Qr factor-
ization on a multicore node enhanced with multiple gpu accelerators. IEEE Int. Parallel &
Distributed Processing Symposium (IPDPS), 2011.
[6] Song F, Tomov S, Dongarra J. Enabling and scaling matrix computations on heterogeneous
multi-core and multi-gpu systems. 26th ACM International Conference on Supercomputing
(ICS 2012), ACM: Venice, Italy, 2012.
[7] Boukerche A, Correa JM, Melo A, Jacobi RP. A hardware accelerator for the fast retrieval
of dialign biological sequence alignments in linear space. IEEE Transactions on Computers
2010; 59:808–821.
[8] Phillips JC, Stone JE, Schulten K. Adapting a message-driven parallel application to gpu-
accelerated clusters. SC, 2008.
[9] Bueno J, Planas J, Duran A, Badia RM, Martorell X, Ayguade´ E, Labarta J. Productive
programming of gpu clusters with ompss. IPDPS, IEEE Computer Society, 2012; 557–568.
[10] Augonnet C, Thibault S, Namyst R, Wacrenier PA. Starpu: A unified platform for task sched-
uling on heterogeneous multicore architectures. Concurrency and Computation: Practice and
Experience 2011; 23:187–198.
[11] Gautier T, Ferreira L, Joao V, Maillard N, Raffin B. Xkaapi: A runtime system for data-
flow task programming on heterogeneous architectures. Proc. of IEEE Int. Parallel and Dis-
tributed Processing Symposium (IPDPS), Boston, USA, 2013.
[12] Chen L, Ye D, Zhang G. Online scheduling on a cpu-gpu cluster. TAMC 2013; 7876:1–9.
[13] Hochbaum DS, Shmoys DB. Using dual approximation algorithms for scheduling problems
theoretical and practical results. J. ACM 1987; 34(1):144–162.
[14] Topcuoglu H, Hariri S, Wu MY. Performance-effective and low-complexity task scheduling
for heterogeneous computing. IEEE TPDS 2002; 13(3):260–274.
[15] Kedad-Sidhoum S, Monna F, Mounie´ G, Trystram D. Scheduling independent tasks on multi-
cores with gpu accelerators. Proc. HeteroPar 2013, Aachen August 2013; :228–237.
[16] Garey MR, Grahams RL. Bounds for multiprocessor scheduling with resource constraints.
SIAM Journal on Computing 1975; 4:187–200.
[17] Blazewicz J, Ecker K, Pesch E, Schmidt G, Weglarz J. Handbook on Scheduling, From Theory
to Applications, International Handbooks on Information Systems. Springer, 2007.
[18] Lenstra JK, Shmoys DB, Tardos E. Approximation algorithms for scheduling unrelated par-
allel machines. Mathematical Programming 1988; 46:259–271.
[19] Shmoys DB, Tardos E. An approximation algorithm for the generalized assignment problem.
Mathematical Programming 1993; 62:461–474.
[20] Shchepin EV, Vakhania N. An optimal rounding gives a better approximation for scheduling
unrelated machines. Operations Research Letters 2004; 33:127–133.
[21] Friesen DK. Tighter bounds for lpt scheduling on uniform processors. SIAM Journal on
Computing 1987; 16(3):554–560.
[22] Ne´lis V, Raravi G. A ptas for assigning sporadic tasks on two-type heterogeneous multipro-
cessors. RTSS 2012; .
[23] Imreh C. Scheduling problems on two sets of identical machines. Computing 2003; 70:277–
294.
[24] Seifu S. Scheduling on heterogeneous cluster environments. Master’s Thesis, Grenoble uni-
versity Jun 2012.
[25] Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov
S. Numerical linear algebra on emerging architectures: The plasma and magma projects.
Journal of Physics: Conference Series 2009; 180.
[26] Bolze R, Cappello F, Caron E, Dayde´ MJ, Desprez F, Jeannot E, Je´gou Y, Lanteri S, Leduc J,
Melab N, et al.. Grid’5000: A large scale and highly reconfigurable experimental grid testbed.
IJHPCA 2006; 20(4):481–494.
[27] Bleuse R, Gautier T, Lima JF, Mounie´ G, Trystram D. Scheduling data flow program in
xkaapi: A new affinity-based algorithm for heterogeneous architectures. 20th International
European Conference on Parallel Processing, ARCoSS/LNCS, Springer: Porto, Portugal,
2014. To appear.
