Scheduling Task-parallel Applications in Dynamically Asymmetric
  Environments by Chen, Jing et al.
Scheduling Task-parallel Applications in
Dynamically Asymmetric Environments
Jing Chen
Chalmers University of Technology
chjing@chalmers.se
Pirah Noor Soomro
Chalmers University of Technology
pirah@chalmers.se
Mustafa Abduljabbar
Chalmers University of Technology
musabdu@chalmers.se
Madhavan Manivannan
Chalmers University of Technology
madhavan@chalmers.se
Miquel Pericàs
Chalmers University of Technology
miquelp@chalmers.se
Abstract
Shared resource interference is observed by applications as
dynamic performance asymmetry. Prior art has developed
approaches to reduce the impact of performance asymmetry
mainly at the operating system and architectural levels. In
this work, we study how application-level scheduling tech-
niques can leverage moldability (i.e. flexibility to work as
either single-threaded or multithreaded task) and explicit
knowledge on task criticality to handle scenarios in which
system performance is not only unknown but also chang-
ing over time. Our proposed task scheduler dynamically
learns the performance characteristics of the underlying
platform and uses this knowledge to devise better schedules
aware of dynamic performance asymmetry, hence reducing
the impact of interference. Our evaluation shows that both
criticality-aware scheduling and parallelism tuning are ef-
fective schemes to address interference in both shared and
distributed memory applications.
CCS Concepts • Computer systems organization →
Multicore architectures.
Keywords Interference awareness, Task scheduling, Asym-
metry
ACM Reference Format:
Jing Chen, Pirah Noor Soomro, Mustafa Abduljabbar, Madhavan
Manivannan, and Miquel Pericàs. 2020. Scheduling Task-parallel
Applications in Dynamically Asymmetric Environments. In 49th
International Conference on Parallel Processing - ICPP : Workshops
(ICPP Workshops ’20), August 17–20, 2020, Edmonton, AB, Canada.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACMmust be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8868-9/20/08. . . $15.00
https://doi.org/10.1145/3409390.3409408
ACM,NewYork, NY, USA, 11 pages. https://doi.org/10.1145/3409390.
3409408
1 Introduction
Modern computer systems are designed to handle a large va-
riety of dynamic events triggered by multiple sources, such
as I/O, user activity, scheduled jobs, O/S, power management,
etc. Applications sharing system resources will observe a
performance (i.e. execution time) variability during such
interfering events. Understanding the causes and proper-
ties of interference is thus of great importance. Prior work
has mainly focused on HPC environments, where execution
time delays are particularly costly [3, 14, 29]. In these sce-
narios, common strategies to reduce interference include
eliminating unnecessary activities such as process schedul-
ing or memory management [23], and carefully managing
interrupts to minimize application perturbation [27]. As the
scale and heterogeneity of systems keeps increasing, tack-
ling interference becomes an even larger concern [24]. While
the aforementioned techniques can generally mitigate inter-
ference, sources of performance variability are overall very
diverse in modern highly interconnected systems, and can
only be addressed effectively in cooperation with application
knowledge.
Applications are typically not aware of interference, but
rather observe such events as temporary episodes of perfor-
mance asymmetry. Performance asymmetry can be defined
as the case when individual cores have different progress
rates (e.g.MIPS) [4]. From an architectural perspective, sources
of asymmetry can be broadly categorized into fixed and dy-
namic. Fixed sources appear due to hardware features, for
example, same-ISA cores with different compute capabilities
(e.g. big.LITTLE [2]). Hence, fixed asymmetry is permanent
and does not evolve over time. Dynamic sources of perfor-
mance asymmetry arise from execution-time activities such
as DVFS for power management [20] or the sharing of re-
sources between applications [31].
Performance asymmetry is particularly problematic for
multithreaded applications with frequent synchronization. A
simple event slowing down the execution of a single thread
can potentially result in a global performance degradation
1
ar
X
iv
:2
00
9.
00
91
5v
1 
 [c
s.D
C]
  2
 Se
p 2
02
0
ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada J. Chen, P. N. Soomro, M. Abduljabbar, M. Manivannan, and M. PericÃăs
by delaying sibling threads waiting at a synchronization
point. Unlike related work that has focused on system-level
approaches, this paper explores runtime-level techniques
applicable to DAG-based parallel applications. Runtime sys-
tems usually operate at the user level and have little control
over the system. Hence, one of the scenarios that we consider
is DVFS activity that is beyond control of the runtime sys-
tem. The second scenario that we consider are applications
that are co-scheduled for execution, as is generally the case
in HPC environments and in data centers. Scheduling such
applications in scenarios with fixed performance asymme-
try has been studied recently. For example, CATS [9] and
GCA [16] are schedulers that use algorithms to identify the
critical path to prioritize tasks and schedule them on the
faster cores. However, if faster cores suffer from interfer-
ence, these schedulers will continue placing critical tasks
on the perturbed partitions (a set of execution resources
such as cores or sockets), potentially leading to sub-optimal
performance. Continuous introspection is a step towards
handling more dynamic environments. The AllScale run-
time [1] collects runtime performance information and feeds
it into an optimiser module that can tune thread numbers
and use DVFS to optimize the execution. However, this run-
time targets homogeneous architectures, and the solution
is not applicable to systems exhibiting dynamic per-core
performance asymmetry (such as interference). The high
performance scheduling of DAGs on multicore systems in
which cores are not only performance asymmetric but have
performance that varies at runtime is -to the best of our
knowledge- an unsolved problem in literature.
Our study indicates that statically mapping critical tasks
to a fixed core results in suboptimal performance in the pres-
ence of dynamic performance asymmetry. Our first observa-
tion is that a performance model that dynamically learns and
updates the platform’s performance characteristics needs to
be developed to quickly yet consistently identify interfer-
ence. In this paper, we leverage a simple tracing scheme
called Performance Trace Table (PTT) [28] that can be used
to predict the performance characteristics on a per-task level.
Next, we develop a task scheduler targeting systems with
unknown and variable performance characteristics. Starting
from a fixed-asymmetry scheduling baseline similar to CATS,
we explore two novel directions. In the first approach, we
leverage the PTT to steer critical tasks to the higher per-
forming cores, in an attempt to speedup the critical tasks. In
the second approach, we use the performance model (PTT)
to choose an appropriate partition of cores for each task.
This is done to reduce interference resulting from resource
oversubscription. The aforementioned approaches are then
evaluated in two specific interference scenarios: dynamic
power management and job co-scheduling. These scenarios
occur frequently in HPC environments and in data centers
due to techniques employed for improving utilization and en-
ergy efficiency under power limits. While aggregating cores
to reduce oversubscription provides improved robustness to
interference, our findings indicate that steering critical tasks
provides higher gains in the presence of interfering activities
and should be thus prioritized, unless the workload cannot
differentiate between critical and non-critical tasks.
In summary, this paper proposes application-level schedul-
ing techniques to improve interference-robustness in parallel
applications. The main contributions of this work are as fol-
lows:
• We explore the performance of random work stealing
and modern fixed-asymmetry criticality-based sched-
ulers in the presence of interference and observe that
the achieved performance is largely suboptimal.
• We show how simple online trace-based models can be
used to model the platform’s performance characteris-
tics on a per-task basis, including the case in which the
per-core performance characteristics vary over time.
• We explore two techniques to reduce the impact of
interference. First, we analyze the impact of steering
critical tasks to cores dynamically detected as being
fastest according to the trace-based model. Second, we
study the impact of assigning multi-core partitions for
the execution of each individual task.
The remainder of this paper is organized as follows. In
Section 2 we describe the execution model that underlies
this research. We present our runtime approach in Section 3.
Section 4 describes the experimental setup, including the
implementation and the benchmarks which are used for the
evaluation of our approach in Section 5. Finally, Section 6
describes related work, while Section 8 concludes the work.
2 Background
We consider DAGs composed of tasks having either high
or low priority. DAGs are commonly used to describe the
computations of multithreaded applications [5, 6, 11] and of
workflows composed of multiple dependent programs [22].
The DAG model is a very general model for program execu-
tion that can support both regular and irregular computa-
tions. Regular computations can be described by creating all
nodes and edges ahead of program execution (static DAG).
Iterative programs with intra-loop parallelism can be de-
scribed by unrolling the external loop and generating a layer
of tasks for each iteration. In addition, irregular computa-
tions can be described by allowing tasks to conditionally
insert new tasks into the DAG at runtime (dynamic DAG).
User-specification of task criticality is common in task-
parallel models. It supported, for example, by OpenMP task
priorities [6]. Criticality can also be inferred dynamically
by the runtime system [9]. Tasks identified as high priority
include tasks that release a large amount of dependent tasks,
or tasks that lie on the DAG’s critical path. The remaining
tasks are then classified as low priority tasks. In the sample
DAG shown in Figure 1, tasks T0, T1 and T5, marked with
2
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada
Figure 1. An illustration of a task DAG. DAG parallelism is
4. T0, T1, T5 and T9 are marked as high priority tasks.
darker color, are high priority tasks while the rest are low
priority tasks. Another important attribute in this model that
describes the DAG’s concurrency is the DAG parallelism [5].
We define it as the total amount of tasks divided by the
length of the longest path. For instance, the partial DAG
shown in Figure 1, has a DAG parallelism of 4. The platform
model considers multiple execution resources hereafter re-
ferred to both as cores or threads. Cores share the same ISA,
but their performance is not necessarily aligned. Per-core
performance is determined by static factors such as core
asymmetry (big cores, little cores), and dynamic factors such
as dynamic voltage-frequency scaling or time-sharing with
other processes. We consider that task execution is mold-
able, i.e. a single task can run in parallel on a variable amount
of cores. The set of resources allocated for the execution of
a task is called its execution place. Formally, an execution
place is a tuple of two numbers (core, resource width) where
core identifies the starting thread number, and resource width
describes how many threads cooperate to execute the task.
Finally, a resource partition comprises sets of execution re-
sources. Meaningful resource partitions are those cores that
share, for example, cache levels, memory channels, NUMA
nodes, etc. For instance, NVIDIA Jetson TX2 platform used
in our experiments consists of two resource partitions: a
dual-core NVIDIA Denver cluster and a quad-core ARM A57
cluster, each with its own shared L2 cache. In this case, the
supported resource widths for a task running on the Denver
partition are 1 and 2, whereas the supported resource widths
for a task on the A57 partition are 1, 2 and 4, as shown in
Figure 2(a).
3 Dynamic Asymmetry Scheduler
Work stealing schedulers randomly map tasks irrespective
of the capabilities of the underlying resources or changes
in the execution environment, and can suffer from resource
contention and from load imbalance especially when sched-
uling applications with low DAG parallelism. In this section,
we present our proposed scheduling solution that adapts to
various forms of dynamic asymmetry while making no prior
assumptions about the underlying architecture or nature of
the workload.
Algorithm 1 Dynamic Asymmetry Scheduler
1: Input: task type, core id, task priority, Trace Model (TM)
2: Output: execution place
3: if low priority task then
4: Local search to minimize TM(core id, width)×width
5: end if
6: if high priority task then
7: if scheduler == DAM-C then
8: Global search to minimize TM(core id, width)×width
9: end if
10: if scheduler == DAM-P then
11: Global search to minimize TM(core id, width)
12: end if
13: end if
3.1 Overview
To mitigate the effects of variability, the scheduler should
detect and react to dynamic asymmetry as soon as it appears.
The proposed dynamic asymmetric scheduler builds on two
key ideas. Firstly, it detects dynamic asymmetry through
online performance (i.e. execution time) monitoring of tasks.
This leverages the observation that applications typically
experience the effects of dynamic asymmetry in the form
of performance variation over a period of time. Secondly, it
schedules to minimize the impact of interference. Two tech-
niques are explored to achieve this: 1) predicting the best
possible execution place for high priority tasks according to
the online performancemodel, and 2) molding tasks (i.e. num-
ber of assigned cores) to reduce inter-task contention and
resource oversubscription , i.e. when a certain hardware
resource (e.g. cache level size) does not fit the task require-
ments. These techniques allow the scheduler to enhance
resource usage and improve throughput by exploiting the
DAG parallelism. To facilitate detection and adaptation, the
scheduler requires an online trace model that predicts the
best execution place for a task across the different resource
partitions and also if there is potential benefit by enabling
moldable execution. We describe the model utilized in this
work in Section 4.1.1.
3.2 Scheduling Algorithm
In this section, we describe the proposed scheduling tech-
niques by introducing our scheduling algorithm targeting
dynamic asymmetry. Algorithm 1 describes how ready tasks
are assigned their execution places. This algorithm is in-
voked by the worker after dequeuing a ready task from the
local work queue but prior to execution. In the case of a low
priority task, the scheduler attempts to determine the best
resource width while keeping the mapping of the task to
its local resource partition and the core fixed. This policy
enhances data-reuse across dependent tasks. In order to de-
termine the best resource width, the scheduler leverages the
trace model , which returns the predicted execution time
on a given resource partition. This model is maintained for
3
ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada J. Chen, P. N. Soomro, M. Abduljabbar, M. Manivannan, and M. PericÃăs
each task type. Finally, the width that minimizes the paral-
lel cost is selected by minimizing the product of resource
width and predicted execution time. We refer to the process
of determining the best resource width as local search since
it involves keeping the resource partition and the core fixed
while molding only the resource width. In the case of a high
priority task, the scheduler attempts to determine the best
execution place for the task among the different resource
partitions in the system. As discussed previously, the trace
model is leveraged to obtain execution time prediction for
the possible places that a task can be mapped to. The ex-
ecution place that minimizes parallel cost by finding the
lowest product of resource width and predicted execution
time is selected. We refer to this process as global search since
it involves sweeping through all possible execution places.
Given that the number of high priority tasks is usually only a
small fraction, this strategy should, in principle, not result in
overcommitting the fast cores. We refer to this scheduler as
DAM-C (Dynamic Asymmetry scheduler withMoldability,
targeting parallel Cost) hereafter since the scheduler strives
to reduce parallel cost and minimize resource usage.
In scenarios where parallelism is limited, reducing the par-
allel cost of tasks can be ineffective as it can lead to increased
core idleness. Thus, we propose a variant that performs a
global search for critical tasks and selects the execution place
that minimizes the predicted execution time. This scheduler
is hereafter referred to as DAM-P (Dynamic Asymmetry
scheduler withMoldability, with critical tasks targeting best
parallel Performance) since the scheduler strives to improve
parallel performance of the critical tasks.
4 Implementation and Experimental
Methodology
We describe an implementation of the the trace model re-
ferred to in Section 3.2. This implementation is an online
performance model (called PTT) and is described in Sec-
tion 4.1.1. In addition, the implementation of the dynamic
asymmetry scheduler is described in Section 4.1.2. Both com-
ponents are implemented on top of XiTAO1, a DAG runtime
implemented on top of C++11 designed to evaluate sched-
uling policies [26]. Section 4.2 describes the experimental
methodology used to evaluate the schedulers.
4.1 Implementation Details
4.1.1 Performance Trace Table (PTT)
Figure 2(a) shows a representation of a core-cluster with four
cores and three possible resource widths of 1, 2 and 4. The
corresponding PTT organization is shown in Figure 2(b). The
goal of this table is to produce a performance estimate for ev-
ery possible resource partition that can be assigned to a task.
The number of entries in the table is thus a product of the
Number of cores and the valid Resource widths. The entries
1https://github.com/mpericas/xitao.git
(a) Resource Width (b) Performance Trace Table
Figure 2. PTT organization for four cores. Cx denotes the
core number, EP(x,y) represents the task’s execution place.
are initialized to zero. This ensures that all possible execution
places are evaluated at least once during the first phases of
execution. Due to the decentralized implementation of the
scheduler, the table is organized such that individual rows
fit into cache lines. Each core mainly accesses a single cache
line indexed with its own core id, keeping access latency
low. Each entry of the PTT keeps track of the execution time
of the task, as observed by the leader core, with a specific
width, which significantly simplifies the implementation. To
avoid large fluctuations in the values of the PTT, which could
result from short isolated events, we update the PTT entries
by computing a weighted average.
Our sensitivity analysis (see Section 5.3) suggests that
each entry be updated with a weighted ratio of 1:4, that
is, updated_value = [(4 × old_value) + 1 × new_value)]/5.
This averaging ensures that after a performance variation,
at least three measurements need to be taken before the PTT
value becomes closer to the new value. Although weighted
averaging results in an additional PTT read operation, it is
crucial to be resilient to divergent measurements as the PTT
is critical for making scheduling decisions. One such table
is instantiated for each task type as the performance varies
per type.
Setting up the PTT only requires information about the
number of cores and their organization into core-clusters
with shared caches. This information can readily be obtained
using tools like hwloc [7]. Upon completion of each task, the
workers simply update the corresponding index in the PTT
thus implementing a dynamic online model for performance
prediction for a particular task type. A task type refers to
each function implemented as a task. Within XiTAO it refers
to the C++ class describing the functionality. Note that there
is one PTT for each task type. Tasks execution times are
measured during normal execution, instead of a profiling
phase. The measured execution times are thus impacted by
co-scheduling of other tasks in the system and the shared re-
source interference they generate. As long as the application
behavior does not change quickly, estimates generated by
the performance model will be aware of co-scheduled tasks.
Although performance prediction using PTT is simple, our
evaluation, in Section 5 using different platforms, shows that
this approach is effective for improving performance in the
4
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada
presence of dynamic asymmetry. The overhead of globally
searching the whole PTT is in the order of one microsecond
in our evaluation platform (NVIDIA Jetson TX2). We are
aware that the design, however, may result in non negligi-
ble overheads when scaling to platforms with large amount
of execution places and cores. The design and evaluation
of scalable performance prediction models is left for future
work.
4.1.2 Dynamic Asymmetry Scheduler
The scheduler is implemented as an extension on top of the
XiTAO runtime system. XiTAO relies on work stealing [5]
for assigning tasks to workers and supports moldable execu-
tion of tasks. This is achieved by implementing two queues
for each worker: a Work Stealing Queue (WSQ) and a FIFO
Assembly Queue (AQ) [26]. The WSQs hold the ready tasks
and use random work stealing for load balancing. The actual
execution place is selected only after a task becomes ready
(i.e. after all its input dependencies have been satisfied). At
this point, pointers to the tasks are inserted into all AQs
representing the execution place for the task, from where
they are finally executed by the cores. We disable the steal-
ing of high priority tasks in order to guarantee that all such
tasks are executed according to their scheduling decision
(cores id and optimal resource width). Low-priority tasks,
on the other hand, are subject to random work stealing by
idle workers. Figure 3 illustrates the operation of the sched-
uler when executing the task DAG shown in Figure 1. The
steps denote the lifetime of a task from wake-up (release
by predecessor) to commit (finalization). We assume that
T0 has been executed on core 0 with resource width=1 and
has updated the PTT entry (c=0,w=1). Core 0 then wakes
up the child tasks of T0, namely T1 (high priority task), T2,
T3 and T4 (low-priority tasks). During step 1 and 2 (shown
in Figure 3), ready tasks read the PTT to determine the best
execution place. For this example, we assume that the best
PTT configurations for T1, T2, T3 and T4 are (2,2), (0,2), (0,2)
and (0,2), respectively. For T1, we globally search the PTT to
determine the best execution place and insert the task in the
WSQ of core 2, while for the other tasks, i.e. T2, T3, T4, we
insert them into the local queue of their parent T0, i.e. the
WSQ of core 0. When the WSQs of cores (C1,C3) are empty,
the workers attempt to steal low-priority tasks from other
WSQs that have more tasks (as indicated in step 3). Cores 1
and 3 successfully manage to steal T2 and T4 from the WSQ
of core 0. After a successful steal, the PTT is visited again
(step 4) to determine the best execution place. For T2 and
T4, width=1 is chosen after performing a local search of the
PTT again (instead of width=2 from the earlier search) as
indicated in step 5. After this step, these tasks are distributed
to the corresponding AQs as indicated in step 6. In step 7, the
cores fetch task partitions from their own AQs for execution.
After the leader core finishes execution, it updates the corre-
sponding PTT entries using the weighted update described
Figure 3. Implementation overview of Dynamic Asymmetry
Scheduler on top of the XiTAO runtime system.
in Section 4.1.1, which is necessary for training the table
(step 8). As cores 2 and 3 can execute T1 asynchronously,
the runtime takes no further action until both cores finish
executing the priority task. The last core to complete the
execution of the task then wakes up the dependent tasks
from the task pool (T5 to T8).
4.2 Experimental Methodology
4.2.1 Evaluation Platforms
We evaluate the proposed scheduler on two platforms. The
first is an NVIDIA Jetson TX2 development board, featuring
a dual-core NVIDIA Denver 2 64-bit CPU and a quad-core
ARM A57 cluster (each with 2 MB L2 cache). The platform is
asymmetric since the Denver cores are generally faster than
the A57 cores. We use this platform to evaluate interference
due to co-running applications and scenarios of DVFS inter-
ference on fixed asymmetric platform. The code is compiled
using gcc 5.4.0 on Linux version 4.4.38-tegra. The
second platform consists of four dual-socket 10-core Intel
2650v3 (code-named "Haswell") nodes connected via Mel-
lanox ConnectX-3 FDR Infiniband and run Linux version
3.10.0. We use the Haswell cores as a symmetric platform
to evaluate interference due to co-running applications, com-
piled using the icpc (ICC) 19.0.5.281 compiler. For MPI
code, we leverage the Intel MPI Library v2018.
4.2.2 Benchmarks
We evaluate the described scheduling techniques using syn-
thetic benchmarks and two applications : K-means and Heat.
Synthetic Directed Acyclic Graph: We construct syn-
thetic DAGs with the notion of priority tasks and with spe-
cific DAG parallelism. In the DAG, each layer consists of a
same number of tasks P , equal to the DAG parallelism, and
same type of task. One of the tasks is marked as critical.
Upon the execution of the critical task, another set of P tasks
with the same characteristics are released. The constructed
DAG comprises nodes that represent one of three kernels:
5
ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada J. Chen, P. N. Soomro, M. Abduljabbar, M. Manivannan, and M. PericÃăs
Matrix Multiplication, Copy and Stencil. Each DAG node
(task) is moldable and can execute on a variable number of
processors. The node types are described below:
MatrixMultiplication represents the compute-intensive
class of workloads. MatricesA, B andC are pre-allocated and
partitioned in tiles of N × N . This kernel executes general
matrix multiplication calls within the graph nodes.
Copy represents thememory-intensive class of workloads.
This kernel reads andwrites large portions of data tomemory,
effectively creating a streaming behavior whereby the main
memory is accessed continuously.
Stencil represents the cache-intensive class of workloads.
It performs repeated updates of values associated with points
on a multi-dimensional grid using the values at a set of
neighboring points.
Unless specified otherwise, the matrix-tile-size (per task)
for theMatMul kernel is 64×64, whereas the number of tasks
in the DAG is 32000. In the case of Copy and Stencil kernel
the matrix tile-size (per task) is 1024×1024, and number of
tasks in the DAG is 10000 and 20000, respectively.
K-means Clustering application is selected from the
Rodinia Benchmark suite [8]. It is a representative of the data-
parallel class of applications. The XiTAO runtime interface
supports loop-parallel constructs, and provides the ability to
tune the granularity of the loop task partitions and to nest
the loop in a graph node. These features are utilized in our
implementation to describe K-means as a dynamic DAG.
Distributed 2D Heat is implemented as an iterative dis-
tributed 2D stencil in which MPI calls are encapsulated into
specific TAOs that are responsible for exchanging the bound-
aries (ghost cells). There is one such exchange per iteration.
Due to the criticality of such communication, theseMPI tasks
are marked as high priority tasks.
4.2.3 Scheduling Policies
To evaluate the various scheduling techniques described in
this paper in the context of both fixed and dynamic perfor-
mance asymmetry, we evaluate a set of scheduler config-
urations on the TX2 platform. The scheduling techniques
considered during the evaluation are summarized in Table 1.
The random work-stealing scheduler (RWS), behaves as a
decentralized greedy scheduler where each thread owns a
work-stealing queue. Irrespective of their priority, child tasks
are pushed to the local queues and allowed to be stolen in
order to mitigate load-imbalance. We then extend this sched-
uler with the option of moldability (RWSM-C), which allows
to aggregate resources and execute tasks over a set of cores
with the goal of improving parallel cost in a similar fashion
to DAM-C. The implementation of RWSM-C requires the
integration of a performance model to select a number of
cores. To support this option, RWSM-C includes an imple-
mentation of the performance trace table (PTT), as described
in Section 4.1.1. The third scheduler is a criticality-based
scheduler designed for fixed (static) asymmetric architec-
tures, called Fixed Asymmetry Scheduler (FA). In this scheme,
priority information is processed by the runtime and high-
priority tasks are strictly mapped to statically faster cores
(the Denver cores in the context of the TX2 platform). The
scheduler is inspired by prior works like the Critical-Path-
on-a-Processor algorithm [30] and the Criticality-Aware Dy-
namic Task-Scheduling (CATS) [9], both of which are based
on the assumption that performance asymmetry remains
unchanged over time. Unlike CATS, our work does not ad-
dress the problem of determining task criticality dynami-
cally. Hence, FA and FAM-C (described next) rely on the
static scheme described in Section 2. Similar to RWS, we ex-
tend the FA scheduler with the option of moldability target-
ing parallel cost. This yields the fourth configuration called
FA+Moldability (FAM-C). Finally, we also implement a vari-
ant of the DAM schedulers but without moldability, simply
called Dynamic Asymmetry Scheduler (DA). This scheduler
only searches for the fastest core on which to execute criti-
cal tasks and then executes these tasks on a single core. All
these different scheduler configurations help us isolate and
evaluate the additional gains introduced by incorporating
dynamic asymmetry awareness, criticality-aware scheduling
and task moldability. We evaluate their impact in the next
section.
Table 1. Features summary of all evaluated schedulers
Name [A]symmetry awareness [M]oldability Priority placement
RWS N/A N/A N/A
RWSM-C N/A Yes Resource [C]ost
FA [F]ixed No N/A
FAM-C [F]ixed Yes Resource [C]ost
DA [D]ynamic No N/A
DAM-C [D]ynamic Yes Resource [C]ost
DAM-P [D]ynamic Yes [P]erformance
5 Performance Evaluation
The goal of the evaluation is to understand the performance
impact of the schedulers depicted in Table 1 during episodes
of interference. We evaluate two common scenarios that in-
troduce dynamic asymmetry during application execution.
The first is interference due to co-running applications. The
second is DVFS-based interference due to power manage-
ment. We evaluate the impact of dynamic asymmetry on
an asymmetric as well as on symmetric hardware platforms.
We begin in Section 5.1 by evaluating and analyzing the
performance impact of interference arising from co-running
applications. Following this, in section 5.2, we study the
impact of DVFS, while in Section 5.3 we evaluate the sensi-
tivity of the scheduler parameters to the task granularity and
the weighted update strategy of the PTT. These evaluations
are all conducted on the asymmetric NVIDIA Jetson TX2
platform. We then evaluate the impact of co-running applica-
tions together with the three applications on the symmetric
Intel Haswell platform in Section 5.4.
6
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada
0
500
1000
1500
2000
2500
3000
3500
2 3 4 5 6
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
DAG Parallelism
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
(a) Matrix Multiplication
0
400
800
1200
1600
2000
2 3 4 5 6
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
DAG Parallelism
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
(b) Copy
0
200
400
600
800
1000
1200
2 3 4 5 6
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
DAG Parallelism
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
(c) Stencil
Figure 4. The performance impact of co-running application interference and comparison between different schedulers with
DAG parallelism ranging from 2 to 6.
(C0,1)
14%
(C1,1)
24%
(C2,1)
16%
(C3,1)
15%
(C4,1)
15%
(C5,1)
16%
(a) RWS
(C0,1)
3.7%
(C1,1)
9%
(C0,2)
4.5%
(C2,1)
16%
(C3,1)
16.6%(C2,2)
4.4%
(C4,1)
15.5%
(C5,1)
16%
(C4,2)
5.9%
(C2,4)
8.6%
(b) RWSM-C
(C0,1)
50%
(C1,1)
50%
(c) FA
(C0,1)
35%
(C1,1)
48%
(C0,2)
17%
(d) FAM-C
(C0,1)
2%
(C1,1)
98%
(e) DA
(C0,1)
1.8%
(C1,1)
96.7%
(C0,2)
1.3%
(f) DAM-C
(C0,1)
2%
(C1,1)
92%
(C0,2)
2%
(C2,4)
4%
(g) DAM-P
Figure 5. Distribution of priority tasks on each core.
5.1 Dynamic Asymmetry Awareness
Figure 4 presents the throughput of different schedulers in
the presence of a co-running application on core 0 that per-
sists throughout the whole execution time. The throughput
numbers are computed by dividing the total number of tasks
by the total execution time. For the case of matrix multipli-
cation and stencil synthetic DAG, the co-running application
consists of a single chain of tasks composed of matrix multi-
plication kernels. This results in CPU interference. For the
case of the copy synthetic DAG, the co-running application is
a single task chain of copy kernels, which results in memory
interference. The results show that the Dynamic Asymmetry
Schedulers, including DA, DAM-C and DAM-P, provide the
highest throughput for the different levels of DAG paral-
lelism and across different kernels, thanks to the ability to
adapt to interference. Although the fixed asymmetry sched-
ulers (FA and FAM-C) provide higher throughput than the
random work stealing variants by exploiting knowledge of
task criticality, they still leave considerable room for im-
provement. For the compute-bound matrix multiplication
kernel, DAM-C achieves up to 3.5× speedup compared to
RWS.Also, DAM-C achieves up to 90% and 85% performance
enhancement compared respectively to FA and FAM-C across
different levels of DAG parallelism. Although DAM-P forma-
trix multiplication is slightly worse than DA and DAM-C on
parallelisms 2 and 3, it still achieves much higher throughput
than other schedulers. In Figures 4(a) and 4(c), we can observe
that the performance of RWS, FA and FAM-C is somewhat
linearly proportional to the DAG parallelism, while DAM-C
and DAM-P already achieve close to the maximum through-
put when parallelism is low. We conduct additional analysis
to understand the gains from dynamically steering the ex-
ecution of priority tasks using our scheduler in dynamic
environments. For this, we consider matrix multiplication
with DAG parallelism of 2 (and in presence of co-running ap-
plication on core 0). In this configuration, high priority tasks
constitute 50% of the entire DAG. Additionally, cores 0 and 1
refer to the two Denver cores, while core 2 to 5 refer to the
four A57 cores. Figure 5 presents the distribution of priority
tasks on cores for each scheduler. The labels on the pie chart
denote the execution place. For example, (C2,4) means that
the starting core is 2 and the task resource width is 4, i.e. it
runs on cores {2,3,4,5}. Figure 6 shows the cumulative work
time for each thread. Note that it shows the accumulation
of kernels’ work time on each core excluding the runtime’s
activity and idleness.
Figure 5(a) shows that with RWS, the priority tasks end
up being almost uniformly distributed across the 4 A57 cores
(around 15% each). The faster Denver core ends up executing
the largest fraction (24%) of the high priority tasks while the
Denver core that experiences interference executes the least
amount (14%) of high priority tasks. Since the background
activity is running on Denver core 0, core 1 is subsequently
faster, which results in 10% more total tasks (with a lower
7
ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada J. Chen, P. N. Soomro, M. Abduljabbar, M. Manivannan, and M. PericÃăs
worktime as shown in figure 6) running on core 1. Although
the RWS scheduler does not show a dramatic jump in the
work time of core 0, it still shows the worst performance
as it does not consider task priority. Figure 5(b) shows that
only 17.2% (3.7%+9%+4.5%) of priority tasks are executed on
Denver cores because of the interference when the moldabil-
ity is enabled in RWSM-C, which leads to less task stealing
from A57 to Denver cores compared with RWS. In this case,
RWSM-C with interference on Denver core 0 is more load
balanced than RWS when the parallelism is low, since more
priority tasks complete the execution on A57 cores and can
release new tasks faster. FA has a fixed notion of system
asymmetry and strictly assigns the priority tasks to the faster
cores, i.e. Denver cores 0 and 1. Figure 5(c) reflects this fact
by showing all priority tasks equally executed on the two
Denver cores. However, it is obvious that FA is not adaptive
to interference, which results in the highest execution time
on core 0 as shown in Figure 6. When moldability is enabled
(FAM-C), 17% of the critical tasks execute on both cores 0
and 1 (resource width = 2). Core 1 runs 13% more tasks than
core 0 due to interference on core 0. However, FAM-C is
still restrictive in the treatment of priority, because it has
no means for migration of priority tasks. This appears as a
loss in throughput compared to DA, DAM-C and DAM-P,
due to the delays in releasing the DAG parallelism. Finally,
the dynamic schedulers show more interference awareness
than other schedulers and exhibit similar trend across DA,
DAM-C and DAM-P as shown in Figure 5 and 6. DA only
has as few as 2% of all priority tasks running on the inter-
fered core 0. The majority of tasks are migrated to the faster
core of the platform. This ensures that both the background
application and the foreground DAG are not affected signifi-
cantly. When moldability is used (DAM-C), 1.3% of priority
tasks are executed on (C0,2). DAM-P magnifies the impact
of priority by selecting the fastest possible execution place.
Figure 5(g) shows that with DAM-P most of tasks (92%) exe-
cute on the faster Denver core 1 with resource width 1. 4%
of tasks execute on four A57 cores (width = 4), since at some
points DAM-P predicts that spanning the whole A57 cluster
is faster during interference.
0
5
10
15
20
25
30
35
40
45
50
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
Ex
ec
ut
io
n 
Ti
m
e 
[s]
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Total
Figure 6. Execution time of each thread while co-running
application runs on Denver core 0.
5.2 DVFS Awareness
To analyze the response to DVFS, we induce periodic fre-
quency changes in the TX2 Denver cluster alternating be-
tween the highest and the lowest frequency (2035 MHz and
345 MHz, respectively) with a 10s period for a full cycle
(i.e. 5s+5s). The performance impact on the different sched-
ulers due to such event is shown in Figure 7.We note that DA,
DAM-C and DAM-P are all more resilient to DVFS than other
schedulers. For instance, with the copy benchmark (shown
in Figure 7(b)), DAM-C achieves roughly 2.2× and 1.9× av-
erage performance speedup relative to RWS and RWSM-C
across different degrees of DAG parallelism. Additionally,
DAM-C provides an average 17% and 12% throughput im-
provement over FA and FAM-C, respectively. Nevertheless,
DAM-P performs better than DA and DAM-C when paral-
lelism is low since it tries to minimize the execution time of
priority tasks by selecting the fastest execution place irre-
spective of parallel cost. Consequently, it is more likely to
exploit multiple cores per task. In contrast, the DAM-C vari-
ant conservatively chooses lower widths during interference
events. This is due to reducing the parallel cost by excluding
the cores with perturbed performance. At low parallelism,
reducing parallel cost leads to a sub-optimal schedule due to
increased idleness. In such cases, DAM-P performs better as
it targets highest parallel performance for priority tasks to
compensate for the low parallelism.
5.3 Sensitivity Analysis
Figure 8 presents the sensitivity analysis of the weighted
update strategy of PTT discussed in section 4.1.1 and the
performance impact from different tile size, for matrix mul-
tiplication. The legends in Figure 8 demonstrate the weight
ratio of new execution time. For instance, 2/5 suggests that
the new execution time is assigned a weight of 2 and the old
execution time in PTT is assigned a weight of 5−2 = 3. Since
the L1 data cache size is different for A57 (32KB) and Denver
(64KB) cores, we test different tile sizes to understand the
impact of tile size on throughput. Note that with the tile size
of 32 the working set of a task fits in both the A57 and the
Denver L1 data caches while with the tile size of 64 and 80
it only fits in the Denver L1 data cache. The tile size of 96
indicates the case when the working set will only fit in the
L2 data cache (2MB) of each cluster. It can be seen that the
performance is only impacted by the weight ratio of updat-
ing strategy when the tile size is 32, which shows that 1/5 in
this case is the best. The throughput breakdown between the
best and worst is around 36%. When the tile size increases,
the weight ratio has much less impact on performance and
the throughput keeps stable. Therefore, we select 1/5, i.e. 1:4,
for the updating PTT in this paper.
8
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada
0
500
1000
1500
2000
2500
3000
3500
2 3 4 5 6
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
DAG Parallelism
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
(a) MatMul
0
200
400
600
800
1000
1200
1400
1600
2 3 4 5 6
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
DAG Parallelism
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
(b) Copy
0
100
200
300
400
500
600
700
800
900
2 3 4 5 6
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
DAG Parallelism
RWS RWSM-C FA FAM-C DA DAM-C DAM-P
(c) Stencil
Figure 7. The performance impact of DVFS and comparison between different schedulers.
0
2
4
6
8
10
12
14
16
18
32 64 80 96
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
x1
00
0
Tile Size
 1/5  2/5  3/5  4/5 1
Figure 8. Sensitivity analysis of data tile size and weight
ratio of updating PTT.
5.4 Interference on Applications
We now discuss the performance of K-means and 2D Heat
on a 10-core dual-socket Haswell platform. In these exper-
iments, we run the interfering app on a single of the two
sockets. One of the challenges that needs to be addressed
is that a simple model like the PTT may not have enough
training data within a single iteration to detect interference.
In other words, for the 20 cores of this configuration, there
are many resource partition choices to exhaust. Hence, the
co-running application starts a few iterations after the start
ensuring a reasonable window for training, since both codes
are iterative. In K-Means, we map the loop partitions to dy-
namically scheduled tasks and assign the high priority to
the task containing the largest work unit. In the case of 2D
Heat, the critical tasks are those that perform boundary point
communication. Figure 9(a) plot the execution time (Y -axis)
across different iterations (X -axis) for K-Means. We drop the
FA and FAM-C schedulers because the Intel Haswell platform
is statically symmetric. The background interference inter-
val is marked by the blue dotted lines. On average, DAM-P
exhibits the best performance during interference. However,
due to the limited accuracy of system clocks and chang-
ing conditions during time measurement, variability arises
across iterations. The analysis in Figure 9 (b) and (c) provides
insight into the differences in the context of K-means. Each
curve here represents the number of tasks scheduled using
the corresponding execution place in the K-means applica-
tion. Figure 9(b) shows that RWS selects tasks for scheduling
on the interference socket (marked by dotted lines), but also
it shows load-imbalance mostly due to intra-task interfer-
ence and resource over-subscription (140 tasks on core 15,
130 on core 8 and around 120 on the rest). Figure 9(c) shows
that DAM-P prefers to mold the tasks on the 8 cores of socket
1 during the event, thus minimizing inter and intra process
interference for high priority K-means tasks.
Finally, we show the performance a distributed memory
DAG-based 2D Heat stencil application on 4 “haswell” nodes
(totaling 80 cores) as shown in Figure 10. The interference
matrix multiplication kernel is executed on 5 cores of a sin-
gle socket of node 0. It can be seen that the performance
of Dynamic Asymmetric Schedulers in general are better
than other schedulers. Even though communication tasks
utilize a single core at a time, which is an inherent nature of
message passing, moldability reduces resource contention
and provides a benefit since sharing CPU caches can have a
significant impact onMPI communication [25]. This explains
the higher throughput of DAM-C and DAM-P compared to
DA. Additionally, DAM-C achieves 76% and 17% throughput
increases compared to RWS and RWSM-C, respectively.
6 Related Work
Performance variability due to interference is highly evident
as it may arise from on-chip variations [17, 21], network ac-
tivity [18, 24], I/O traffic [12, 13], etc. To study performance
variability, Ates et al. [3] introduce a performance anomaly
generator for the major HPC subsystems that assesses the
performance resilience of applications to different variability
sources. Works that try to mitigate interference can broadly
be classified into two groups: those that deal with interfer-
ence at the cluster level and those that deal with interference
at the node level.
System noise at the cluster level has been thoroughly ana-
lyzed on large-scale architectures. For example, Hoefler et
al. use LogGPS simulations to get insights into the scaling
9
ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada J. Chen, P. N. Soomro, M. Abduljabbar, M. Manivannan, and M. PericÃăs
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Ti
m
e 
(s
)
Iteration Number
DAM-P DAM-C RWS
(a) K-means Clustering
0 10 20 30 40 50 60 70 80 90 100
Iteration Number
0
20
40
60
80
100
120
140
Ta
sk
 C
ou
nt
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1
10,1
11,1
12,1
13,1
14,1
15,1
(b) RWS
0 10 20 30 40 50 60 70 80 90 100
Iteration Number
0
100
200
300
400
500
600
Ta
sk
 C
ou
nt
0,1
0,4
0,8
2,1
8,4
8,8
(c) DAM-P
Figure 9. Performance of K-means clustering on 16-core Haswell (a), and high priority resource selection during interference
on socket 0 across K-means clustering iterations (20-70) for DAM-P and RWS ((b) & (c)). The dotted lines represent the socket
0 partitions
0
50
100
150
200
250
300
350
400
450
RWS RWSM-C DA DAM-C DAM-P
Th
ro
ug
hp
ut
 [T
as
ks
/s
]
Figure 10. Performance comparison of distributed 2D Heat
using different schedulers.
of applications in noisy environments [15]. In [29], inter
and intra-application resource contention are classified as
sources of performance variability. On the runtime systems’
side, a real-time monitoring component that assists the run-
time scheduler in its decision-making process and adapta-
tion is developed as part of the AllScale toolchain [1]. By
collecting information across a cluster, the AllScale runtime
can tune parameters such as thread counts or DVFS. How-
ever, the collected information is too coarse to address either
asymmetry or interference by influencing task scheduling.
At the node level, the idea of using system behavior ob-
servations has been proposed to influence the design of OS
scheduling [19] to reduce the interference caused by shared
resources in chip multiprocessors. Zhuravlev [31] propose a
new scheduling algorithm that mitigates the effects of shared
resource contention. This scheduler, called Distributed In-
tensity Online (DIO), collects miss rates online for all appli-
cations and schedules applications to minimize performance
impact. At the runtime level, Chronaki et al. [10] introduce
various schedulers, including CATS, based on the idea of
steering critical tasks to statically faster cores. To evaluate
CATS the authors introduce the dynamic Heterogeneous
Earliest Finish Time (dHEFT) algorithm as a reference to
evaluate CATS. HEFT (Heterogeneous Earliest Finish Time)
is a static scheduling algorithm that assigns each task to the
processor that will finish its execution at the earliest possi-
ble time [30]. dHEFT uses the same principles as HEFT but
instead of knowing the load of tasks prior to scheduling, dis-
covers them at runtime. While these schedulers can improve
the execution time of task-DAGs in which tasks have diverse
behaviors, they have a few limitations. First, none of them
are able to avoid resource over-subscription and adapt to
dynamic asymmetry. And second, all of them are based on
the notion of only two static performance classes, i.e. big and
LITTLE. Not only are our Dynamic Asymmetry schedulers
able to model the performance of all cores without prior
assumptions on the hardware, but they can also exploit this
information to mitigate the impact of interference.
7 Conclusion
In this paper we have explored techniques for effective sched-
uling of parallel applications in the presence of dynamic
performance asymmetry. Our findings indicate that random
work stealing schedulers are not effective because they do
not leverage information about task priority nor on the ca-
pability or state of the underlying resources. Schedulers that
have a fixed notion of platform asymmetry perform bet-
ter than random work stealing schedulers but still leave a
lot of room for improvement because they do not posses
knowledge about dynamic changes in the execution environ-
ment. Our proposed dynamic asymmetry aware schedulers
schedule high-priority tasks around interference and enable
moldable execution of tasks resulting in a improvement over
traditional scheduling approaches, not only during user-level
background activity but also during DVFS. The proposed
schedulers can detect and adapt to interference by using
an online performance model based on a performance trac-
ing table. All in all, we believe that combining user-level
approaches such as Dynamic Asymmetry Scheduler (DAS)
with system-level approaches is a significant step to achieve
10
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments ICPP Workshops ’20, August 17–20, 2020, Edmonton, AB, Canada
tolerance to the increasingly challenging problem of unpre-
dictable system interference.
8 Acknowledgment
The computations/data handling were enabled by resources
provided by the Swedish National Infrastructure for Comput-
ing (SNIC) at C3SE partially funded by the Swedish Research
Council through grant agreement no. 2016-07213. The re-
search leading to these results has received funding from the
European UnionâĂŹs Horizon 2020 Programme under the
LEGaTO Project (www.legato-project.eu), grant agreement
no 780681.
References
[1] X. Aguilar, H. Jordan, T. Heller, A. Hirsch, T. Fahringer, and E. Laure. An
on-line performance introspection framework for task-based runtime
systems. In Computational Science – ICCS 2019, 2019.
[2] ARM. Arm big.little. https://www.arm.com/why-arm/technologies/
big-little, 2020.
[3] E. Ates, Y. Zhang, B. Aksar, J. Brandt, V. J. Leung, M. Egele, and A. K.
Coskun. Hpas: An hpc performance anomaly suite for reproducing per-
formance variations. In Proceedings of the 48th International Conference
on Parallel Processing, ICPP 2019, 2019.
[4] S. Balakrishnan, Ravi Rajwar, M. Upton, and K. Lai. The impact of
performance asymmetry in emerging multicore architectures. In 32nd
International Symposium on Computer Architecture (ISCA’05), 2005.
[5] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded compu-
tations by work stealing. Journal of the ACM, 46(5), 1999.
[6] O. A. R. Board. Openmp application program interface. version 4.5,
2015.
[7] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,
G. Mercier, S. Thibault, and R. Namyst. hwloc: A generic frame-
work for managing hardware affinities in hpc applications. In 2010
18th Euromicro Conference on Parallel, Distributed and Network-based
Processing, 2010.
[8] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and
K. Skadron. Rodinia: A benchmark suite for heterogeneous computing.
In 2009 IEEE International Symposium on Workload Characterization
(IISWC), 2009.
[9] K. Chronaki, A. Rico, R. M. Badia, E. Ayguadé, J. Labarta, and M. Valero.
Criticality-aware dynamic task scheduling for heterogeneous archi-
tectures. In Proceedings of the 29th ACM on International Conference
on Supercomputing, ICS ’15, 2015.
[10] K. Chronaki, A. Rico, M. Casas, M. MoretÃş, R. M. Badia, E. AyguadÃľ,
J. Labarta, and M. Valero. Task scheduling techniques for asymmetric
multi-core systems. IEEE Transactions on Parallel and Distributed
Systems, 28(7), 2017.
[11] A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell,
and J. Planas. Ompss: A proposal for programming heterogeneous
multi-core architectures. Parallel Processing Letters, 21(02), 2011.
[12] A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert, and M. Snir.
Scheduling the i/o of hpc applications under congestion. In 2015 IEEE
International Parallel and Distributed Processing Symposium, 2015.
[13] L. F. Góes, P. Guerra, B. Coutinho, L. Rocha, W. Meira, R. Ferreira,
D. Guedes, andW. Cirne. Anthillsched: A scheduling strategy for irreg-
ular and iterative i/o-intensive parallel jobs. In D. Feitelson, E. Fracht-
enberg, L. Rudolph, and U. Schwiegelshohn, editors, Job Scheduling
Strategies for Parallel Processing, 2005.
[14] T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the influ-
ence of system noise on large-scale applications by simulation. In
Proceedings of the 2010 ACM/IEEE International Conference for High
Performance Computing, Networking, Storage and Analysis, SC ’10, 2010.
[15] T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the influ-
ence of system noise on large-scale applications by simulation. In
Proceedings of the 2010 ACM/IEEE International Conference for High
Performance Computing, Networking, Storage and Analysis, SC ’10, 2010.
[16] C.-H. Hsu, C.-W. Hsieh, and C.-T. Yang. A generalized critical task
anticipation technique for dag scheduling. In H. Jin, O. F. Rana, Y. Pan,
and V. K. Prasanna, editors, Algorithms and Architectures for Parallel
Processing, 2007.
[17] Y. Inadomi, T. Patki, K. Inoue, M. Aoyagi, B. Rountree, M. Schulz,
D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, M. Kondo, and
I. Miyoshi. Analyzing and mitigating the impact of manufacturing
variability in power-constrained supercomputing. In SC ’15: Proceed-
ings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, 2015.
[18] N. Jain, A. Bhatele, X. Ni, T. Gamblin, and L. V. Kale. Partitioning
low-diameter networks to eliminate inter-job interference. In IEEE
Intl Parallel and Distributed Processing Symposium (IPDPS), 2017.
[19] R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using os observa-
tions to improve performance in multicore systems. IEEE Micro, 28(3),
2008.
[20] E. Le Sueur and G. Heiser. Dynamic voltage and frequency scaling: The
laws of diminishing returns. In Proceedings of the 2010 Intl Conference
on Power Aware Computing and Systems, HotPower’10, 2010.
[21] X. Liang and D. Brooks. Mitigating the impact of process variations
on processor register files and execution units. In 2006 39th Annual
IEEE/ACM Intl Symposium on Microarchitecture (MICRO’06), 2006.
[22] B. LudÃďscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones,
E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the
kepler system. Concurrency and Computation: Practice and Experience,
18(10), 2006.
[23] J. Moreira, M. Brutman, J. Castaños, T. Engelsiepen, M. Giampapa,
T. Gooding, R. Haskin, T. Inglett, D. Lieber, P. McCarthy, M. Mundy,
J. Parker, and B. Wallenfelt. Designing a highly-scalable operating
system: The blue gene/l story. In Proceedings of the 2006 ACM/IEEE
Conference on Supercomputing, SC ’06, 2006.
[24] T. Patki, J. J. Thiagarajan, A. Ayala, and T. Z. Islam. Performance
optimality or reproducibility: That is the question. In Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, SC ’19, 2019.
[25] S. Pellegrini, T. Hoefler, and T. Fahringer. On the effects of cpu caches
on mpi point-to-point communications. In 2012 IEEE International
Conference on Cluster Computing, 2012.
[26] M. Pericàs. Elastic places: An adaptive resource manager for scalable
and portable performance. ACM Trans. Archit. Code Optim., 15(2), 2018.
[27] R. Riesen, R. Brightwell, P. G. Bridges, T. Hudson, A. B. Maccabe, P. M.
Widener, and K. Ferreira. Designing and implementing lightweight ker-
nels for capability computing. Concurrency and Computation: Practice
and Experience, 21(6), 2009.
[28] A. Rohlin, H. Fahlgren, andM. Pericas. High performance scheduling of
mixed-mode dags on heterogeneous multicores. In Workshop on High
Performance Energy Efficient Embedded Systems 7th Edition (HIP3ES),
2019. arXiv:1901.05907.
[29] D. Skinner and W. Kramer. Understanding the causes of performance
variability in hpc workloads. In IEEE International. 2005 Proceedings of
the IEEE Workload Characterization Symposium, 2005., 2005.
[30] H. Topcuoglu, S. Hariri, and Min-You Wu. Performance-effective and
low-complexity task scheduling for heterogeneous computing. IEEE
Transactions on Parallel and Distributed Systems, 13(3), 2002.
[31] S. Zhuravlev, S. Blagodurov, and A. Fedorova. In Proceedings of the
Fifteenth Edition of ASPLOS on Architectural Support for Programming
Languages and Operating Systems, ASPLOS XV, 2010.
11
