An Adaptive Performance-oriented Scheduler for Static and Dynamic
  Heterogeneity by Chen, Jing et al.
An Adaptive Performance-oriented Scheduler for Static and
Dynamic Heterogeneity
Jing Chen
Chalmers University of Technology
chjing@chalmers.se
Pirah Noor Soomro
Chalmers University of Technology
pirah@chalmers.se
Mustafa Abduljabbar
Chalmers University of Technology
musabdu@chalmers.se
Miquel Perica`s
Chalmers University of Technology
miquelp@chalmers.se
Abstract
With the emergence of heterogeneous hardware paving the way
for the post-Moore era, it is of high importance to adapt the run-
time scheduling to the platform’s heterogeneity. To enhance adap-
tive and responsive scheduling, we introduce a Performance Trace
Table (PTT) into XiTAO, a framework for elastic scheduling of
mixed-mode parallelism. e PTT is an extensible and dynamic
lightweight manifest of the per-core latency that can be used to
guide the scheduling of both critical and non-critical tasks. By
understanding the per-task latency, the PTT can infer task perfor-
mance, intra-application interference as well as inter-application
interference. We run random Direct Acyclic Graphs (DAGs) of dif-
ferent workload categories as a benchmark on NVIDIA Jetson TX2
chip, achieving up to 3.25× speedup over a standard work steal-
ing scheduler. To exemplify scheduling adaption to interference,
we run DAGs with high parallelism and analyze the scheduler’s
response to interference from a background process on a Intel
Haswell (2650v3) multicore workstation. We also showcase the
XiTAO’s scheduling performance by porting the VGG-16 image
classication framework based on Convolutional Neural Networks
(CNN).
CCSConcepts •Soware and its engineering→Runtime en-
vironments;
Keywords Heterogeneous architectures, Interference, Performance,
Dynamic scheduling
1 Introduction
In order to deal with the stringent energy constraints of current IT
systems, modern HPC systems are increasingly being architected
as heterogeneous platforms. Systems may include both static and
dynamic sources of heterogeneity. Static heterogeneity sources
are those that are xed at design time, for example, single-ISA
cores with dierent power-eciency (e.g. big.LITTLE), or asym-
metric ISA systems consisting of processor cores and accelerators
(e.g. CPU/GPU/FPGA platforms). Dynamic sources of heterogene-
ity are those that arise from runtime reconguration. For example,
usage of Dynamic Voltage-Frequency Scaling (DVFS) [11] to tune
the performance and eciency of individual cores or clusters of
cores is an example of dynamic heterogeneity. Another example is
the usage of cache partitioning to tune cores to the working sets
of applications [9]. e landscape of heterogeneous congurations
creates a challenging scheduling problem.
To make maers worse, there are many sources of uncontrolled
heterogeneity generally called interference. Interference refers to
performance degradation resulting from shared resources. Inter-
ference can occur within a process, e.g. resulting from oversub-
scription of caches and memory bandwidth, or it can occur across
processes, e.g. resulting from time-sharing of a single processing
unit in order to run multiple concurrent applications. Obviously,
designing scheduling strategies that address all these sources of
heterogeneity under a optimization target (performance, energy,
etc.) is a complex task. A solution necessarily requires online
monitoring that can identify and adapt both to static and dynamic
heterogeneity.
A scheduling solution for multithreaded computations requires
an overlying execution model that is exible and eective for perfor-
mance portability. Prior research suggests that mixed-mode parallel
computations, in which the nodes of a computational task-DAG
are themselves parallel computations that can be assigned dierent
amounts of processing resources [3, 12, 19], are a promising execu-
tion model for modern hierarchical and heterogeneous platforms.
An important result from our research [12] is that greedy sched-
uling [2] is oen an undesirable goal as it easily leads to resource
over-subscription, particularly with the bandwidth-constrained and
cache-constrained nature of modern compute systems1. is occurs
because greedy schedulers will always try to schedule a ready task
as long as there are idle processing units, even if the resulting inter-
ference results in a global deterioration of performance. One way
to avoid the problem of resource oversubscription is to aggregate
enough resources into a single execution place such that poten-
tially interfering processes are forced to wait. In the context of
mixed-mode parallelism this translates into providing the internally
parallel tasks with enough resources to avoid interference.
e XiTAO library2 is an embodiment of this concept. In XiTAO,
parallel tasks are scheduled into resource partitions called Elastic
Places [12]. e method has been shown to perform eciently on
homogeneous manycore and NUMA systems. To achieve higher
power eciency, however, it is important to make XiTAO aware
of heterogeneous platforms. Furthermore, XiTAO traditionally
requires the programmer to statically determine the size of the
N-core-places. is property limits productivity and performance
portability.
In this paper, we explore and propose schemes to automatically
determine resource partitions at runtime. Furthermore, we research
how this knowledge can be used to leverage modern single-ISA
platforms with both static and dynamic sources of heterogeneity. To
this end, we present a scheduler inspired by Criticality-Aware Task
Scheduling (CATS) [6] and extend it with a performance trace table
1hps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-
over-time/
2hps://sites.google.com/site/mpericas/xitao
ar
X
iv
:1
90
5.
00
67
3v
1 
 [c
s.D
C]
  2
 M
ay
 20
19
(PTT) that monitors the system’s performance characteristics at
runtime. Despite its simplicity, the PTT provides enough informa-
tion to implement both heterogeneity-aware and interference-free
schedules at runtime at minimum cost. Most notably, these features
are achieved with no static knowledge about the features of the
application and no platform knowledge beyond what can be easily
obtained with a tool such as hwloc [8].
To validate our proposal, we evaluate the scheme using irregular
DAGs and the VGG-16 neural network on both an Nvidia Jetson
TX2 development board featuring an heterogeneous multicore ar-
chitecture, and on an Intel Xeon Haswell platform composed of two
NUMA nodes and a total of 20 cores. Our experiments focus on the
benets of criticaltity-aware and interference-free scheduling on
both static and dynamic heterogeneity. Sources of heterogeneity
include static platform characteristics and dynamic interference
episodes, such as co-scheduling of conicting processes. We con-
clude by analyzing the scalability of a VGG-16 implementation on
the PTT-enabled XiTAO framework.
In summary, this paper contributes:
• A heterogeneous scheduler on top of XiTAO and a study of
the impact of its main tuning parameters.
• A tracing scheme called Performance Trace Table (PTT)
that allows to infer the dynamic system heterogeneity.
• An in depth evaluation of adaptability of the PTT in the
context of interference and architectural heterogeneity.
• A showcase implementation of VGG-16 on XiTAO and a
scalability analysis.
e remainder of this paper is organized as follows. Section 2 de-
scribes the background of graph theory which is used to describe
computations. We present our approach in Section 3. Section 4
describes the experimental setup which is used to evaluate our ap-
proach (Section 5). Section 6 describes related work, while Section 7
concludes the work.
2 Task-DAG Scheduling
is work focuses on computations that can be described as DAGs
(Directed Acyclic Graphs). An example of a task-DAG is shown in
Figure 1. We dene the critical path of a task-DAG as its longest
path. In Figure 1, the critical path is marked with doed lines whose
length is 5. is can be obtained by traversing the nodes A→ C→
G→D→ F. Tasks in the critical path are from now on called critical
tasks, while the others are non-critical tasks. Nodes B and E are
non-critical tasks in Figure 1. e task-DAG in Figure 1 also shows
the criticality values of each node. Seing the criticality value in the
DAG is achieved by traversing the DAG boom-up until it reaches
the start node(s). Hence, this requires the full DAG to be available
before execution. e criticality value is set to be the maximum
criticality of children plus 1. Eectively, this results in the rst
node of the longest path having the highest criticality value. As we
can see, task A has the highest criticality. is strategy also shows
how criticality can be determined at runtime: the task’s current
criticality value is compared with the parent’s criticality value. If
the dierence is 1, then the task is on the critical path. Another
important concept in this paper is the average DAG parallelism,
which we dene as Parallelism = Number of total tasksNumber of cr it ical tasks . For
example, the parallelism in Figure 1 is 7/5=1.4.
A
E C
G
D
F
B
5
3
4
3
2
1
4
Figure 1. An illustration of a DAG with a critical path in doed
lines. Dierent color represents dierent kernels. e number of
tasks is seven and the critical path has length ve.
3 Scheduling for Heterogeneous Systems
We begin this section by introducing the main concepts of the
XiTAO runtime. We then present the performance trace table (PTT),
a data structure that implements an online model of the execution
time of each task type. Based on the PTT data structure, we then
describe our implementation of the performance-based scheduler.
3.1 e XiTAO library
XiTAO is a novel runtime for executing mixed-mode computations
in which the individual tasks of a task-DAG are themselves parallel
computations. ese parallel computations are usually data-parallel
computations, but any sort of parallel structure is possible. In Xi-
TAO, the individual task is a Task Assembly Object (TAO) and the
task-DAG is called the TAO-DAG. For generality, we use the terms
task and TAO interchangeably in this paper. A TAO type contains
a concurrent computation, an internal scheduler, and a resource
width. At runtime, the TAO type is instantiated by providing input
arguments to its functionality. e resource width denotes how
many cores are used to execute a task. e resource width must
be a natural divisor of the number of available logical cores in a
particular core-cluster, e.g. such as a NUMA domain. e leader
core is the logical core with the smallest id. Resource partitions as-
signed to TAOs are composed of consecutive core ids. At execution
time, the runtime pins the logical threads to physical cores so that
consecutive thread ID’s map to cores sharing the same last level
cache.
e XiTAO scheduler implements two queues for each core: a
work stealing queue (WSQ) and a FIFO assembly queue (AQ). e
WSQs store the ready tasks and use random work stealing as a
policy for load balancing. When a ready TAO is fetched from the
WSQ and its resource width is determined, pointers to the TAO
are then inserted into all AQs representing the resource partition
of the TAO. Subsequently, each core asynchronously fetches these
pointers and executes the TAO. Note that the resource partitions
are irrevocable once assigned. Once the TAO pointers have been
inserted into the AQs, the TAO must necessarily execute in the
selected set of resources. In other words, all scheduling decisions
must happen before the TAO is inserted in the AQs. More details
about the theory and implementation of XiTAO can be found in [12].
3.2 Performance Trace Table
Typical scheduling implementations surveyed in Section 6 assume
prior knowledge of task loads. However, that is not applicable in
our case where the runtime has no prior knowledge of the core type.
0Core
Width =1
Width =2
Width =4
1 2 3
Core 0
Core 1
Core 2
Core 3
Width = 1 Width = 2 Width = 4
Figure 2. Example of a PTT with four cores. e resource width
can be 1, 2, or 4.
In a heterogeneous platform, this work considers the case in which
no assumption is made on which type of cores are faster and which
kinds of tasks are more suitable for the dierent core types. ere-
fore, to be able to intelligently distribute tasks to the corresponding
core type and to dynamically aect the scheduling decisions based
on the available resources, we introduce a performance tracer of
tasks at runtime and a table (PTT) to model task execution times.
e table provides an online model of the execution time for each
valid combination of leader core and resource width, (core id, resource
width). e le part of Figure 2 shows all the scenarios of resource
width when the total number of cores seen by the runtime is four.
erefore, the resource width in this case can be 1, 2 and 4.
e PTT is implemented in the XiTAO runtime. It is orga-
nized as shown in the right of Figure 2. e size of the table is
core number×resource width number . e elds of the table are ini-
tialized to 0 that models a zero execution time. is ensures that all
conguration pairs will eventually be visited and trained at runtime.
Due to the decentralized implementation of the scheduler, the table
is organized to t into cache lines where each core only accesses
one cache line indexed with core number, hence avoiding false shar-
ing. For each entry, the execution time of the pair is temporarily
stored. en each entry is updated with a weighted time of 1:4,
thus, the old execution time of the entry occupies 80% and the new
time occupies 20%. at is, updated value = (4×old value)+new value5 .
e performance trace table is updated always by the leader core
of a task. is simplies the implementation and reduces cache
migrations. is also means that every core can have a model value
of the task with width = 1 but only every fourth core will have
a model value of the task with width = 4. As shown in Figure 2,
in the case of width=1 (purple eld), each core is the leader for
its own partition. For the case width=2 (orange), the leading core
0 (2) handles the resource partition containing cores 0 (2) and 1
(3). By restricting the leader core to update the PTT, a potential
skew of the model may result, since the leader may not provide
the most accurate record of the execution time of that TAO, i.e.
it could have had the least or the most amount of work. is is
because these workers enter and exit the execution of the TAO
asynchronously [12]. However, the weighted average that is used
to updated PTT values ensures that the impact of imbalance will
be limited. Although averaging results in an additional read of the
table, the table size is small and it is very important to be resilient to
divergent measurements as this table is the key point of scheduling
decisions. While the impact of this feature requires a detailed study,
our experiments so far do not provide evidence that the potential
imbalance is problematic.
is implementation of tracing execution history requires as
lile information as the number of cores and their distribution
Core 0
Core 1
Core 2
Core 3
Width = 1 Width = 2 Width = 4
A B
Work  Stealing Queues
0 1 2 3
Performance Trace Table
A B
0 1 2 3
A B
Assembly Queues
0 1 2 3
C E
0 1 2 3
C C
Assembly Queues
0 1 2 3
C C
E
Steal
G
0 1 2 3
Assembly Queues
0 1 2 3
G GG G
(a) (b)
(c) (d)
Work  Stealing Queues
Work  Stealing Queues Work  Stealing Queues
C
E
D D
0 1 2 3
Assembly Queues
0 1 2 3
Steal
(e)
D D
Work  Stealing Queues
F
0 1 2 3
Assembly Queues
0 1 2 3
F FF F
(f)
Work  Stealing Queues
Steal
Figure 3. e scheme of performance-based scheduler. It includes
seven tasks, distributing into dierent work stealing queues with
the same resource width 1.
into core-clusters with shared caches. e cores simply update the
corresponding index, independent of its resource type, and thus
create a model of performance. In the other words, no maer what
core-types a platform has, be them big or lile, the performance
of these cores will be reected by PTT values. is is benecial
not only for portability and potentially functional-heterogeneity,
but also for temporally added heterogeneity such as DVFS caused
by heat variations, or interference caused by other tasks and/or
uncontrollable system activities such as background processes or
interrupts.
3.3 Performance-based Scheduler
Based on the implementation of the performance trace table, we
develop a heterogeneity-unaware scheduler with the goal of opti-
mizing performance. To achieve this, we follow the basic strategy
of CATS [6] and extend it with our heterogeneity-unaware method-
ology. is scheduler is named performance-based scheduler. e
main feature of the performance-based scheduler is the ability to
nd the optimal cores and resource width using the optimal values
depicted by globally searching the PTT.
e operation of the performance-based scheduler follows four
steps. Figure 3 is an example of the scheduler implementation based
on the task-DAG shown in Figure 1. Firstly, we fetch the ready tasks
from our application task-DAG. ese tasks are then inserted into
a particular work stealing queue according to the default policy or
to programmer annotations. When the task reaches the head of the
WSQ, it is permied to be executed locally or randomly stolen. e
task’s priority is checked to detect whether the task belongs to the
critical path, and thereaer, decide the execution policy. Note that
the criticality of all the tasks of a task-DAG decides the priority.
If a task is critical, e.g. task A,C,G,D or F, we globally search
the performance trace table to nd the optimal pair of core and
resource width for such task. Global search means that all the en-
tries of the PTT of the particular TAO type are checked to nd the
value that globally minimizes exec time × resource width. e goal
of this operation is to nd the pair of core and resource width that
globally minimizes the system’s occupation of resources, under-
stood as the product of resources and execution time. Alternative
optimization strategies are also possible. For example, a system
trying to minimize the energy consumption would instead nd the
best pair that minimizes energy per task.
e overhead of the global search operation is low since the
number of entries in the PTT is only 2×N −1 for each NUMA node
consisting of N cores. In this way, we guarantee that all the critical
tasks are executed on faster cores, obtaining beer performance
and minimizing interference. For non-critical tasks, however, only
the appropriate resource width for the corresponding core is de-
termined from the PTT when the TAO is scheduled for execution.
e goal of this policy is to reduce interference across non-critical
tasks. In summary, critical tasks search the PTT globally to im-
prove performance and reduce interference, while non-critical task
just search the current core’s entries in the PTT with the goal of
avoiding interference.
Initial tasks have no parents and therefore it is not possible to
determine their criticality. In the current implementation they are
treated as non-critical tasks: the PTT is not globally searched, but
we still try to nd a good resource width. For example, we can see
that from Figure 1 that task A and B are initial tasks running on
threads 0 and 1. ey are scheduled to cores 0 and 1 with resource
width 1, respectively, according to the best PTT resource width.
At this point the two tasks are inserted into the corresponding
assembly queues, as shown in Figure 3 (b).
Aer a task nishes, the commit-and-wake-up stage will check if
any dependent tasks are critical. If the child task’s criticality is less
than the parent’s by 1, then the task is determined to be critical. For
example, in Figure 1, task C is the child task with criticality value 4
of task A with criticality value 5. To this end, aer completing the
execution of task A, core 0 could wake up another task which is
dependent on task A. In Figure 3 (c), task C and task E, which are
the children of task A, are woke up sequentially.
However, if there are no ready tasks in the work stealing queue,
the core will try to steal a task from other cores’ work stealing
queues. Since the work stealing queues of core 1, 2 and 3 are
empty aer completing task B, they can steal tasks from other
work stealing queues. For instance, core 1 and 3 steal task C and E
from the work stealing queues of core 0 respectively. As a critical
task, task C globally searches the performance trace table and nds
the optimal pair (0,4), i.e. the leader core is 0 and the resource width
is 4, and then this task is distributed into the assembly queues of
cores from 0 to 3. Task E is a non-critical task, it searches only
the entry for core 3 in the performance trace table for a optimal
width. As resource width=1 is the best choice for core 3, the task
E is distributed into the assembly queue of core 3, as Figure 3 (c)
shows. Since its parent tasks B and C have completed, task G is
woken up by core 1. By searching the performance trace table, it
nds out that the pair (0,4) is the best one, then core 1 distributes
the task into assembly queues from 0 to 3 (see Figure 3 (d)). en,
as shown in Figure 3 (e), the critical task D is woken up by core
1 but stolen by core 2 and the optimal pair for it is determined to
be (2,2). Core 2 complete the task D and then wake up its nal
children and the critical task F with the best conguration (0,4) in
performance trace table, as shown in Figure 3 (f).
4 Experimental Setup
4.1 Evaluation Platforms
e benchmarks herein are evaluated on two platforms. From the
static heterogeneous family, we use a NVIDIA Jetson TX2 develop-
ment board, featuring a dual-core NVIDIA Denver 2 64-bit CPU, a
quad-core ARM A57 Complex (each with 2 MB L2 cache) and an
NVIDIA Pascal Architecture GPU with 256 CUDA cores. Both the
Denver 2 and the A57 cores implement the ARMv8 64-bit instruc-
tion set and are cache coherent. For the purpose of this work, we
consider only the two ARMv8 cores, and leave GPU scheduling
as future work. On the homogeneous side, an Intel 2650v3 (code-
named ”Haswell”) based platform is used to evaluate the eect
of interference while scheduling Random DAGs, and to evaluate
the behavior of XiTAO when executing the Image Classication
network (VGG-16) [16].
4.2 Random Directed Acyclic Graph
4.2.1 Kernels
We generate random DAGs to evaluate the properties of the PTT-
enhanced scheduler. e random DAGs are based on a mix of
dierent kernel types. When selecting the kernels, the priority is to
achieve dierent characteristics in terms of memory-intensiveness
(streaming), cache-intensiveness (i.e. data reuse) and compute-
intensiveness. e following three kernels are selected for this
purpose.
Matrix Multiplication. A matrix multiplication kernel is cre-
ated for the compute-intensive property. We implement a matrix
multiplication that achieves parallelism by ensuring that the writ-
ing of output data is done to separate cache lines for each thread
while still sharing the input data.
Sort. For the data reuse property, a quick sort and merge sort
kernel combination is selected. is kernel rst splits the input
array into chunks and performs in-place sorting with quick sort
before carrying out two levels of merge sort, eectively reusing the
data within the kernel. is kernel has a maximum parallelism of
four.
Copy. Finally, a copy kernel handling large inputs is imple-
mented for the streaming property. is kernel reads and writes
large portions of data to memory, eectively creating a stream-
ing behavior where the kernel has to access the main memory
continuously. Each core copies a subset of the data.
For each kernel, we select the appropriate working set size cor-
responding to the desired behavior. For the matrix multiplication
kernel, we choose a 64 × 64 matrix. For the sort kernel, we choose
a 262KB input array, taking up a total space of 524KB due to double
buering, eectively ing iton the L2 caches. Finally, the copy
kernel uses a 16.8MB array, taking up a total space of 33.6MB, which
is much larger than the space of the L2 caches.
4.2.2 DAG construction
To properly evaluate the performance of our scheduler, randomized
DAGs composed of random selections of these three kernels are
implemented. By tuning the parameters, it is possible to achieve
dierent degrees of average parallelism and thus generate dierent
scheduling scenarios.
To generate a suitable randomized DAG, a set of conguration pa-
rameters are used, similar to the generation of DAGs by Topcuoglu
et al. [17]. e rst parameter is the number of tasks of each kernel.
is is useful to choose which kernel should be most prominent in
the DAG. e second parameter is the average width of the DAG.
is is used to obtain the desired level of parallelism. e last
conguration parameter, the edge rate parameter, determines the
average amount of connected edges a task has, which also aects
the parallelism of the DAG. A seed value is used to manipulate
the randomization to recreate a dierent DAG several times for
comparison.
e DAG generation algorithm produces a DAG in three steps.
e rst step generates the shape (nodes and edges) of the DAG.
is step is separated from the TAO creation step in order to get
proper memory utilization and data reuse (see below). e second
step consists of allocating memory and deciding which tasks are
reusing data. To achieve data reuse between nodes, we maintain a
vector for every kernel where each index in the vector represents a
memory location. Initially, the size of the vector is zero. For every
node, we search its predecessors for a node number matching any
of the numbers in the vector. If a matching number is found, it is
replaced by our current node number and the index to the location
is saved in the node. If we cannot nd a match, a new entry is
created in the vector with the unique node number and that index
is saved in the node instead. e size of the vectors is then used for
allocation of memory and each node will have a designated data
location. e memory is allocated this way to maximize data reuse
between tasks of the same kernel while guaranteeing isolated data
execution when tasks are run in parallel. e nal step is to traverse
the nodes and spawn the corresponding tasks and edges between
them, thus eectively creating the DAG in the XiTAO format.
4.3 Image Classication
To demonstrate the behavior of the performance-based scheduler in
XiTAO, we port the VGG-16 [16] image classication model from
the Darknet framework [14]. e application uses a 16 layered
deep convolutional neural networks (CNNs) to classify an image
using a pre-trained model. Each convolutional (CONV) and fully-
connected (FC) layer implements GEneral Matrix Multiply (GEMM)
that takes most of the computation time. Figure 4 shows the XiTAO
implementation of VGG-16. In VGG-16, input size varies as the
network progresses. For example, the convolutional layer iterates
over a minimum 64 channels to a maximum of 512 channels. ere-
fore, in the XiTAO implementation we partition the work among
TAOs. e number of TAOs in each layer depends on the number
CONV3-64 
CONV3-64 
CONV3-128 
CONV3-128 
FC-4096 
FC-4096 
FC-1000 GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
 TAO 0  TAO 1  TAO N.....
XiTAO
Figure 4. Architecture of VGG-16. here CONVX-Y represents X-D
lter and Y Channels of convolutional layer respectively
of channels and block length. e parameter block length refers to
the number of channels assigned to each TAO, which is tuned at
runtime. Each TAO performs parallel GEMM with the number of
threads equal to the width of TAO. Note that the width is dynami-
cally determined by the XiTAO scheduler. Since there are no loop
carried dependencies inside the layer we benet from two levels
of parallelism in the XiTAO implementation. However, each layer
is dependent on the previous layer, we therefore synchronize all
TAOs at the end of each layer.
5 Performance Evaluation
e outline of the evaluation is as follows: we rst analyze the
impact of using the performance-based scheduler versus the homo-
geneous scheduler (i.e. the base random work stealing algorithm as
implemented in XiTAO [2]) that is both unaware of the hardware
and of the ongoing performance state modeled by the PTT. We
study Random DAGs consisting of a mixture of kernels (MatMul,
Sort, Copy) to cover the spectrum of real DAGs as much as possible.
We also port the Darknet VGG-16 code to XiTAO and compare
it to the base CPU implementation to assess the enhancements
on Convolutional Neural Networks that are of high relevance to
contemporary applications.
5.1 Comparison with Homogeneous Scheduler
Figure 5 denotes a heatmap per each described scheduler executing
between 250-4000 tasks (X-Axis) on random DAGs with a paral-
lelism between 1-16 (Y-Axis). e underlying random DAG is a
combination of the aforementioned kernels with equal propor-
tions. In the most challenging case with low task count and par-
allelism (tasks=250, par=1), we observe that the temperature of
the performance-based scheduler (depicted by Figure 5(a)) is at
least twice higher. e additional ingredient of scheduling criti-
cal tasks on the high performing cores (mainly Denver cores in
this case) and the ability to dynamically tune the resource width
renders this scheduler superior even with no external task-DAG
parallelism. e throughput is higher across the table except for a
few cases of very high parallelism that pose almost no challenge on
scheduling decisions. Another interesting yet expected observation
is that the number of tasks plays a negligible role on the perfor-
mance of the homogeneous scheduler, whereas the throughput of
the counterpart is a factor of both axes. A twofold increase in the
number of tasks provides twice the amount of PTT training data.
is directly reects on the performance by improving the quality
of the dynamic, PTT-based choices. In addition, a higher degree
250 500 1000 2000 4000
Task Number
16
8
4
2
1
Pa
ra
lle
lis
m
500
750
1000
1250
1500
Throughput(TAOs/s)
(a) Performance-based Scheduler
250 500 1000 2000 4000
Task Number
16
8
4
2
1
Pa
ra
lle
lis
m
500
750
1000
1250
1500
Throughput(TAOs/s)
(b) Homogeneous Scheduler
Figure 5. e performance impact over parallelism and number
of TAOs and the performance comparison between performance-
based scheduler and homogeneous scheduler.
of parallelism (on the Y-Axis) permits a beer utilization of the
resources.
5.2 Performance Impact of Kernels
As highlighted before, we select three dierent kernels matrix mul-
tiplication, sort, copy with dierent characteristics. It is therefore
important to evaluate the performance impact of such kernels and
their mixture while varying parallelism. Figure 6 compares the
throughputs of the performance-based scheduler and homogeneous
scheduler on the Jetson TX2 platform for various degrees of paral-
lelism. Besides the higher throughput achieved by the performance-
based scheduler especially for lower parallelism, it exhibits a greater
stability across the X-axis, which is an essential aribute that sug-
gests that the hardware is being eciently utilized and that the
scheduler is less sensitive to parallelism constraints. For a concrete
quantication of performance gains, Figure 7 shows the speedup
achieved using the performance-based scheduler over the homo-
geneous scheduler. In this case, we use 4000 tasks for each kernel
(i.e., matrix multiplication, sort and copy), and a Random DAG that
contains a mixture of even number of tasks/kernel that sum up
to 4000. We can see that our performance-based scheduler gener-
ally runs faster than homogeneous scheduler. Specically, it has
signicant speedup when the parallelism is low. For a parallelism
of 1, it achieves 3.3× of the throughput when compared with the
homogeneous scheduler in matrix multiplication. e speedup
for sort, copy and the mixture of three kernels are 2.5×, 2.2× and
2.7× for the same parallelism, respectively. When the parallelism
increases, the speedup of performance-base scheduler compared
with homogeneous scheduler decreases, but we still have beer
performance than homogeneous scheduler. For scenarios of high
parallelism, criticality-aware scheduling has lile impact on per-
formance. Instead, in such scenarios it is important to try to avoid
interference due to resource oversubscription. ese results high-
light how the PTT can be used to select appropriate resource width
to avoid oversubscription. is is particularly visible for the case
of the sort kernel, which relies on good usage of cache capacity.
Scheduling too many sort kernels in parallel leads to oversubscrip-
tion and performance degradation. Classical heterogeneity aware
schedulers such as HEFT [17] or CATS [6] are not able to address
such scenarios.
5.3 Process Interference
One of the remarkable advantages of using PTT is maximizing
performance via minimizing the side eects of interference. is
feature is especially important since it is highly anticipated that
user or kernel level resources are shared. Figure 8(a) depicts the
response of the XiTAO performance-based scheduler to running a
background parallel process, in this case a chain of MatMul DAGs,
alongside a highly parallel random DAG. e black dots represent
the time-stamp at which the threads start executing TAOs. A verti-
cal green line shows the resource partition used to execute the TAO.
While bootstrapping the PTT, a few width choices are aempted.
At the point of interference (i.e. in Figure 8(a)), we show the PTT
value at (width=1,core=1). Other relevant values are dropped for
brevity. Due to the jiers in PTT values, the scheduler automati-
cally selects cores from (2-9) for executing the critical tasks. Cores
(0-1) are still selected under typical circumstances according to Fig-
ure 8(b). Shortly aer the interference event, the scheduler recovers
to normal operation yielding a marginal wall time dierence across
the two experiments. Note that non-critical task continue to be
executed on cores with interference, as long as these cores succeed
in stealing tasks. is is important so that the PTT is continuously
updated to reect the status of the system.
5.4 ImageNET Classication
Figure 9 depicts a strong scalability study of the performance of
the XiTAO version of the VGG-16 code for predicting a predened
image class by multiple convolutions of a crop layer (1024 x 1024)
converted to matrix (512 x 512 x 3). e study is to assess the
scheduling performance of the conventional fork-join application
class with minimal eort. It is carried out on a dual-socket Intel
Haswell platform. It is worthwhile noting that XiTAO reorders
threads to ensure data locality, since the core ids are not always
laid out continuously, as is the case in both this platform and the
Jetson TX2 platform. e scheduler still exhibits 0.69 parallel e-
ciency compared to the serial performance, even though there is
no criticality notion to this experiment, i.e., all tasks are marked
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
1 2 4 8 16
Th
ro
ug
hp
ut
 [T
AO
s/s
]
Parallelism
MM Sort Copy Mixture
(a) Performance-based Scheduler
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
1 2 4 8 16
Th
ro
ug
hp
ut
 [T
AO
s/s
]
Parallelism
MM Sort Copy Mixture
(b) Homogeneous Scheduler
Figure 6. e performance impact over parallelism and kernels and the performance comparison between performance-based scheduler
and homogeneous scheduler.
 0
 0.5
 1
 1.5
 2
 2.5
 3
1 2 4 8 16
Sp
ee
du
p
Parallelism
MM
Sort
Copy
Mixture
Figure 7. e performance speedup when comparing performance-
based scheduler with homogeneous scheduler with dierent par-
allelism. In this gure, the number of tasks we use is same as
Figure 6.
non-critical. Figure 10 shows the number of TAOs scheduled with
corresponding widths. During execution, PTT chooses the best
width to schedule a TAO. For example, in the case of running VGG-
16 with 8 threads, 67% of TAOs are scheduled with width = 1 and
30% TAOs are scheduled with width = 8, indicating that these widths
lead to the best speed-up. e approximately linear speedup shown
in Figure 9 combined with the PTT-assisted choices of widths as
in Figure 10 demonstrate how negligible the resource tuning over-
head is especially if compared to the potential performance gains
prescribed by this paper.
6 Related Work
6.1 Scheduling in Heterogeneous Environments
Task scheduling on a heterogeneous platform, contrary to a homo-
geneous platform, includes the problem of assigning the appropriate
tasks to the most suitable cores. Most multicore scheduling ap-
proaches today assume equal performance. For example, dynamic
scheduling techniques such as work-stealing or work-sharing do
not consider the individual performance of cores.
Scheduling DAGs on heterogeneous multicores is a well studied
problem in the context of single-threaded task-DAGs [5–7, 10, 17,
18]. ese schemes either assign a ranking to each tasks based
on the critical path and then assign more critical tasks to faster
cores [5–7, 17], or they compute a best t between tasks and cores
and then schedule appropriately [10, 18]. In this study we extend
upon ideas introduced in CATS [6]. In the following paragraph we
describe CATS along with HEFT [17], a classical heterogeneous
scheduler, and Bias Scheduling [10].
HEFT is a static scheduling method for heterogeneous task sched-
uling proposed by Topcuoglu et al. [17]. e HEFT algorithm con-
sists of ranking the tasks of a DAG in order of longest path to
nish and then assigning the highest-ranking tasks to the core that
will minimize the overall nish time. An analysis of the DAG is
done to calculate the execution time and communication cost of
each node and edge before the tasks can be ranked. e tasks are
then placed in a queue where the scheduler picks the top task and
calculates which core will be able to nish this task earliest using
insertion-based scheduling.
CATS is a dynamic scheduling approach where no prior knowl-
edge about the execution time of the tasks is assumed [6, 7]. Instead,
CATS solely uses the number of successors to nd the critical path.
e critical path is then put in a critical queue. Tasks from the
critical queue are scheduled on high-performance cores and tasks
from the non-critical queue are scheduled on lower-performance
cores. In [6], Chronaki et al. introduce the dynamic Heterogeneous
Earliest Finish Time (dHEFT) algorithm as a reference to evaluate
CATS. dHEFT uses the same principles as HEFT but instead of
knowing the load of tasks prior to scheduling, discovers them at
runtime.
Finally, Bias Scheduling [10] is a proposed method for single-ISA
heterogeneous multicore processors that tracks how dierent kinds
of tasks perform on each core. e main idea is to categorize tasks
into two groups: Tasks gaining large speedup by running on a big
core compared to a LITTLE core and tasks gaining modest speedup
by running on a big core. e speedup is approximated by accessing
hardware counters for stall cycles. Tasks are then scheduled on
big cores if they provide large speedup and on LITTLE cores if the
speedup would be modest.
0 2 4 6 8 10 12 14
Elapsed Time [s]
0
1
2
3
4
5
6
7
8
9
Th
re
ad
8
10
12
14
16
18
20
PT
T 
Va
lu
e 
[m
s]
(a) Dynamic migration of processes in response to PTT spikes during
interference.
0 2 4 6 8 10 12 14
Elapsed Time [s]
0
1
2
3
4
5
6
7
8
9
Th
re
ad
(b) e scheduler’s behavior when there is no interference
Figure 8. e eect of interference on PTT scheduling of critical tasks.
40,23
21,24
11,21
6,63
3,62
0
10
20
30
40
50
Serial 2 4 8 16
Ti
m
e 
[s
]
Number of threads
Figure 9. Performance of CPU GEMM on XiTAO VGG-16 with
variable number of threads
69
,0
6
90
,8
9
66
,6
7
53
,8
1
30
,9
4
5,
83
3,
38
1,
683,
28
0,
74
14
,7
62
9,
21
29
,3
1
0,
45
0
20
40
60
80
100
2 4 8 16
Pe
rc
en
ta
ge
 o
f T
AO
s 
w
.r.
t 
TA
O
-w
id
th
Number of threads
1
2
4
8
16
Figure 10. Percentage of TAOs scheduled with corresponding TAO
width by PTT
While all these schedulers can improve the execution time of
task-DAGs in which tasks have diverse behaviors, they have a
few limitations. First, none of them is able to avoid resource over-
subscription and provide interference-free execution. And second,
all of them are based on the notion of only two static performance
classes, i.e. big and LITTLE. In practice, sources of heterogeneity
are diverse, hence performance needs to be tracked and modeled
on a per-core basis. Our proposal can model the performance of
all cores and, furthermore, thanks to its reliance on XiTAO it can
exploit this information to provide interference-free scheduling.
6.2 Management of shared resources
One of the main goals of XiTAO is to provide a good solution for
shared resource contention. e focus of XiTAO is on multithreaded
computations. Prior work on resource contention has mostly fo-
cused on multiprogrammed workloads, i.e. multiple single-threaded
workloads running in parallel. Both soware-based scheduling [20]
and hardware-based partitioning [9, 13] approaches have been pro-
posed to address issues related to cache and memory sharing. Mul-
tiprogrammed workloads are less challenging in the sense that the
dierent tasks do not have any dependencies. Hence, XiTAO ad-
dresses a more problematic case in which scheduling and resource
partitioning decisions can negatively impact the applications ex-
ecution time, particularly if bad decisions are taken concerning
the critical path of the application. In the context of structured
parallelism, such as divide-and-conquer, one scheduler that targets
constructive sharing is the parallel depth rst (PDF) scheduler [1, 4].
Other runtime systems that support mixed-mode parallelism have
been proposed by Wimmer et al. [19] and by Sbirlea et al. [15].
However, they focus only on restricted forms of parallelism, such
as divide-and-conquer [19] or worksharing constructs [15].
7 Conclusion
We have introduced a performance-based scheduler for heteroge-
neous architecture that leverages online monitoring on top of the
XiTAO runtime. e presented scheduler improves task execution
throughput and latency, and it provides interference-free execution
and task migration in the event of process interference. All these
features allow our proposed scheduler to adapt to next generation
heterogeneous systems with shared resources. We evaluated the
performance-bsaed scheduler on random DAGs and compared it to
the homogeneous counterpart.
A future direction of this work is to include GPU kernels as part
of the scheduling strategy, fully utilizing the Jetson TX2 chip by
ooading GEMM kernels of the VGG-16/Random DAGs bench-
mark to the underlying Pascal GPU. We also intend to study the
interaction between non-critical tasks and the PTT in terms of
updating resources and locality.
References
[1] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably Ecient
Scheduling for Languages with Fine-grained Parallelism. J. ACM 46, 2 (March
1999), 281–321. hps://doi.org/10.1145/301970.301974
[2] Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded
Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720–748.
[3] Soumen Chakrabarti, James Demmel, and Katherine Yelick. 1997. Models and
Scheduling Algorithms for Mixed Data and Task Parallel Programs. J. Parallel and
Distrib. Comput. 47, 2 (1997), 168 – 184. hps://doi.org/10.1006/jpdc.1997.1413
[4] Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anas-
tassia Ailamaki, Guy E. Blelloch, Babak Falsa, Limor Fix, Nikos Hardavellas,
Todd C. Mowry, and Chris Wilkerson. 2007. Scheduling reads for Constructive
Cache Sharing on CMPs. Symposium on Parallel Algorithms and Architectures.
[5] Hui Cheng. 2010. A High Ecient Task Scheduling Algorithm Based on Het-
erogeneous Multi-Core Processor. 2010 2nd International Workshop on Database
Technology and Applications 3 (2010), 1–4. hps://doi.org/10.1109/DBTA.2010.
5659041
[6] Kallia Chronaki, Alejandro Rico, Rosa M. Badia, Eduard Ayguade´, Jesu´s Labarta,
and Mateo Valero. 2015. Criticality-Aware Dynamic Task Scheduling for Het-
erogeneous Architectures. In Proceedings of the 29th ACM on International
Conference on Supercomputing (ICS ’15). ACM, New York, NY, USA, 329–338.
hps://doi.org/10.1145/2751205.2751235
[7] K. Chronaki, A. Rico, M. Casas, M. Moreto´, R. M. Badia, E. Ayguade´, J. Labarta,
and M. Valero. 2017. Task Scheduling Techniques for Asymmetric Multi-Core
Systems. IEEE Transactions on Parallel and Distributed Systems 28, 7 (July 2017),
2074–2087. hps://doi.org/10.1109/TPDS.2016.2633347
[8] Brice Goglin. 2016. Towards the Structural Modeling of the Topology of next-
generation heterogeneous cluster Nodes with hwloc. Research Report. Inria. hps:
//hal.inria.fr/hal-01400264
[9] Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan
Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS Policies and Architecture for
Cache/Memory in CMP Platforms. In Proceedings of the 2007 ACM SIGMETRICS
International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS ’07). ACM, New York, NY, USA, 25–36. hps://doi.org/10.1145/
1254882.1254886
[10] David Koufaty, Dheeraj Reddy, and Sco Hahn. 2010. Bias Scheduling in Hetero-
geneous Multi-core Architectures General Terms Algorithms, Performance. In
Proceedings of the 5th European conference on Computer systems. 125–138.
[11] Etienne Le Sueur and Gernot Heiser. 2010. Dynamic Voltage and Frequency
Scaling: e Laws of Diminishing Returns. In Proceedings of the 2010 Interna-
tional Conference on Power Aware Computing and Systems (HotPower’10). USENIX
Association, Berkeley, CA, USA, 1–8. hp://dl.acm.org/citation.cfm?id=1924920.
1924921
[12] Miquel Pericas. 2018. Elastic Places: an adaptive resource manager for scal-
able and portable performance. ACM Transactions on Architecture and Code
Optimization 15, 2 (June 2018). hps://doi.org/10.1145/3185458
[13] Moinuddin K. reshi and Yale N. Pa. 2006. Utility-Based Cache Partitioning:
A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared
Caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, USA,
423–432. hps://doi.org/10.1109/MICRO.2006.49
[14] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
Only Look Once: Unied, Real-Time Object Detection. 2016 IEEE Conference on
Computer Vision and Paern Recognition (CVPR) (Jun 2016). hps://doi.org/10.
1109/cvpr.2016.91
[15] Alina Sbirlea, Kunal Agrawal, and Vivek Sarkar. 2015. Elastic Tasks: Unifying
Task Parallelism and SPMD Parallelism with an Adaptive Runtime. In Euro-Par
2015: Parallel Processing. Lecture Notes in Computer Science, Vol. 9233. Springer,
491–503. hps://doi.org/10.1007/978-3-662-48096-0 38
[16] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Net-
works for Large-Scale Image Recognition. arXiv:cs.CV/1409.1556
[17] H. Topcuoglu and S. Hariri and. 2002. Performance-eective and low-complexity
task scheduling for heterogeneous computing. IEEE Transactions on Parallel
and Distributed Systems 13, 3 (March 2002), 260–274. hps://doi.org/10.1109/71.
993206
[18] Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel
Emer. 2012. Scheduling heterogeneous multi-cores through performance impact
estimation (PIE). In Computer Architecture (ISCA), 2012 39th Annual International
Symposium. 213–224. hps://doi.org/10.1109/ISCA.2012.6237019
[19] Martin Wimmer and Jesper Larsson Tra¨. 2011. Work-stealing for Mixed-mode
Parallelism by Deterministic Team-building. In Proceedings of the Twenty-third
Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA
’11). 105–116. hps://doi.org/10.1145/1989493.1989507
[20] Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Address-
ing Shared Resource Contention in Multicore Processors via Scheduling. In
Proceedings of the Fieenth Edition of ASPLOS on Architectural Support for Pro-
gramming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY,
USA, 129–142. hps://doi.org/10.1145/1736020.1736036
