SimAS: A Simulation-assisted Approach for the Scheduling Algorithm
  Selection under Perturbations by Mohammed, Ali & Ciorba, Florina M.
SimAS: A Simulation-assisted Approach for the Scheduling
Algorithm Selection under Perturbations
Ali Mohammed and Florina M. Ciorba
Department of Mathematics and Computer Science
University of Basel, Switzerland
December 5, 2019
1
ar
X
iv
:1
91
2.
02
05
0v
1 
 [c
s.D
C]
  4
 D
ec
 20
19
2Contents
1 Introduction 4
2 Background and Related Work 6
3 Simulator-Assisted Scheduling Approach (SimAS) 8
4 Experimental Design and Setup 10
4.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Loop scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Simulation-assisted Algorithm Selection . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Computing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Evaluation and Analysis 15
5.1 Load imbalance in PSIA and Mandelbrot . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Performance of Scientific Applications under Perturbations . . . . . . . . . . . . . . 18
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Conclusion and Future Work 40
3Abstract
Many scientific applications consist of large and computationally-intensive loops, such
as N-body, Monte Carlo, and computational fluid dynamics These loops contain
computationally-intensive operations, resulting in heavy loop bodies. Dynamic loop
self-scheduling (DLS) techniques are used to parallelize and to balance the load during the
execution of such applications. Load imbalance arises from variations in the loop iteration
(or tasks) execution times, caused by problem, algorithmic, or systemic characteristics. The
variations in systemic characteristics are referred to as perturbations, and can be caused by
other applications or processes that share the same resources, or a temporary system fault or
malfunction and include, decreased delivered computational speed, reduced available network
bandwidth, or larger network latencies. DLS achieves a balanced load execution of scien-
tific applications on high-performance computing (HPC) systems. Therefore, the selection
of the most efficient DLS technique is critical to achieve the best application performance.
The following question motivates this work: “Given an application, an HPC system, and their
characteristics and interplay, which DLS technique will achieve improved performance under
unpredictable perturbations?” Existing studies focus on variations in the delivered computa-
tional speed only as the source of perturbations in the system. However, perturbations in
available network bandwidth or latency are inevitable on production HPC systems. Also,
scheduling solutions based on optimization techniques, e.g., evolutionary algorithms, can not
adapt to perturbations during execution. The alternative of using machine learning for DLS se-
lection requires training and learning either prior to execution or during previous time-steps in
time-stepping applications. A Simulator-assisted scheduling (SimAS ) is introduced as a new
control-theoretic-inspired approach to dynamically select DLS techniques that improve the
performance of applications executing on heterogeneous HPC systems under perturbations.
The present work examines the performance of seven applications on a heterogeneous system
under all the above system perturbations. SimAS is evaluated as a proof of concept using na-
tive and simulative experiments. The performance results confirm the original hypothesis that
no single DLS technique can deliver the absolute best performance in all scenarios, whereas
the SimAS -based DLS selection resulted in improved application performance in most exper-
iments.
Keywords. Performance, loop scheduling, load balancing, heterogeneous computing sys-
tems, perturbations, simulation
41 Introduction
Scientific applications are often characterized by large and computationally-intensive parallel
loops. The performance of these applications on high-performance computing (HPC) systems
may degrade due to load imbalance caused by problem, algorithmic, or systemic characteristics.
Application (problem or algorithmic) characteristics include the irregularity of the number of
computations per loop iterations due to conditional statements, where systemic characteristics
include variations in delivered computational speed of processing elements (PEs), available net-
work bandwidth or latency. Such variations are referred to as perturbations, and can also be
caused by other applications or processes that share the same resources, or a temporary system
fault or malfunction. Dynamic loop self-scheduling (DLS) is a widely-used approach for im-
proving the execution of parallel applications using self-scheduling, that is, dynamic assignment
of the loop iterations to free and requesting processing elements. A wide range of DLS tech-
niques exists and can be divided into nonadaptive and adaptive techniques. The nonadaptive
DLS techniques account for the variability in loop iterations execution times due to applica-
tion characteristics via modeling their assumptions. The nonadaptive DLS techniques include
self-scheduling (SS), fixed size chunking (FSC) [1], modified fixed size chunking (mFSC) [2],
guided self-scheduling (GSS) [3], trapezoid self-scheduling (TSS) [4] , factoring (FAC) [5], and
weighted factoring (WF) [6] among others. The adaptive DLS techniques account for irregular
system characteristics that are only known during execution by adapting the amount of work
assigned (chunk size) per PE request according to the application performance measured dur-
ing execution. Adaptive DLS techniques include adaptive weighted factoring (AWF) [7], its
variants batch (AWF-B), chunk (AWF-C), batch-like (AWF-D), and chunk-like (AWF-E) [8],
as well as adaptive factoring (AF) [9], among others.
An a priori selection of the most appropriate DLS technique for a given application and sys-
tem is not trivial, given the various sources of load imbalance and the different load balancing
properties of the DLS techniques. This observation raises the following question and moti-
vates the present work: “Given an application, an HPC system, and their characteristics and
interplay, which DLS technique will achieve improved performance under unpredictable pertur-
bations?” Earlier work studied the flexibility of DLS (taken as robustness to variable delivered
computational speed) and the selection of the most robust DLS using machine learning [10]
with the SimGrid (SG) [11] simulation toolkit. The selection of DLS techniques for synthetic
time-stepping scientific workloads using reinforcement learning was also studied using SG [12].
The aforementioned work focuses on one source of perturbations, namely the variation in the
delivered computing speed, in time-stepping applications to learn from previous time steps.
That approach may not be applicable to non-iterative applications. Scheduling solutions us-
ing static optimizations, e.g., using evolutionary and genetic algorithms, can not dynamically
adapt to the perturbations encountered during execution. Modern HPC systems are often
heterogeneous production systems typically shared by many users. Therefore, perturbations
in the available network bandwidth and latency are unavoidable in such systems.
The study of the performance of scientific applications with DLS under perturbations re-
vealed that the most robust DLS technique, identified as the DLS technique that results in the
least variation of the application execution time under various perturbations, does not always
achieve the best performance in all execution scenarios [13]. Figure 1 shows the simulative
performance of PSIA (c.f. Section 4.1) on 696 cores of miniHPC (c.f. Section 4.4) under per-
turbations (c.f. Section 4.6). According to these results, GSS is the most robust DLS technique
due to the minimal variation of its performance under perturbations (Figure1a), however, it
results in poor application performance under perturbations. Even the next most robust DLS
technique, WF, is outperformed by SS and AWF-C in certain perturbation scenarios, as can be
seen in Figure1b. These results suggest that even if the most robust DLS technique is known a
5GSS: the most robust DLS
(a) Variation of PSIA performance with various
perturbations per scheduling technique.
GSS: not the most efficient DLS
AWF-D AWF-C AWF-C AWF-D AWF-D AWF-D WF AWF-C
WF
Most efficient DLS technique in each scenario
SS and AWF-C are the 
most efficient in pea-es
(b) Application performance under various per-
turbations using different scheduling techniques.
Figure 1: Simulative performance of PSIA (c.f. Section 4.1) under perturbations in computing
availability, network bandwidth, and latency (c.f. Section 4.6). The most robust DLS technique
(GSS in Figure 1a) delivers consistent performance under various system perturbations. However,
GSS does not achieve the best performance under all perturbations as shown in Figure 1b. As shown
in the figure, no single technique delivers the best performance in all execution scenarios [13].
priori, which could be challenging, the application performance degrades in different execution
scenarios due to perturbations. Therefore, a methodology for the dynamic selection of DLS
techniques is needed to achieve the highest possible performance in all execution scenarios.
In the present work, in an effort to select the most appropriate DLS dynamically for
a given application and system under perturbations, the Simulation-assisted scheduling
Algorithm Selection, SimAS , approach is proposed. The performance of two scientific ap-
plications (PSIA [14] and Mandelbrot [15]) executed in single-sweep and time-stepping modes,
and five synthetic workloads is studied on a heterogeneous HPC system using nonadaptive
and adaptive DLS techniques, in the presence of perturbations in computing speed, network
bandwidth, and network latency. The amount of operations in each loop iteration of the
five synthetic workloads is assumed to follow five different probability distributions, namely:
constant, uniform, normal, exponential, and gamma probability distributions. The synthetic
workloads are used to cover a broader spectrum of application load imbalance profiles beyond
what is encountered in practice.
The present work makes the following contributions: (1) Proposes a novel simulator-assisted
scheduling (SimAS 1) approach for dynamically selecting a DLS technique during execution
based on the application characteristics and the present (monitored or predicted) state of the
computing system; (2) Extends a dynamic load balancing tool (DLB tool) from the litera-
ture [16] for parallelizing scientific applications into DLS4LB2 with four more DLS techniques,
namely SS, FSC, WF, and TSS. In addition, the DLS4LB is extended to support SimAS as
the fourteenth option to select DLS techniques dynamically during execution; (3) Evaluates
the performance of two real applications (PSIA and Mandelbrot) and five synthetic workloads
using DLS techniques under perturbations via native and simulative experiments; (4) Con-
1https://github.com/unibas-dmi-hpc/SimAS
2https://github.com/unibas-dmi-hpc/DLS4LB
6firms the original hypothesis that no single DLS ensures the best performance in all execution
scenarios considered.
This work is structured as follows. Section 2 contains a brief review of the selected DLS
techniques, the SG simulation toolkit, as well as the work related to the performance of schedul-
ing scientific applications with DLS in the presence of perturbations. The proposed SimAS
approach is discussed in Section 3. The factorial design of experiments, together with details
about the DLS and SimAS implementation into the DLS4LB , the HPC system character-
istics, and the perturbations injected in native and simulative experiments are presented in
Section 4. The analysis of applications load imbalance and the evaluation of the performance
of the applications under perturbations are discussed in Section 5. The work concludes and
outlines potential future work in Section 6.
2 Background and Related Work
Loop scheduling. The aim of loop scheduling is to achieve a balanced load execution among
parallel PEs with minimum scheduling overhead. Loop scheduling can be divided into static
and dynamic. In static loop scheduling, the loop iterations are divided and assigned to PEs
before execution; both division and assignment remain fixed during execution. This work
considers static (block) scheduling, denoted STATIC, each PE being assigned a chunk size
equal to the number of iterations N divided by the number of PEs P . STATIC incurs minimum
scheduling overhead, compared to dynamic loop scheduling, and may lead to load imbalance
for non-uniformly distributed tasks and/or on perturbed systems.
In dynamic loop scheduling (DLS), free and requesting PEs are assigned, via self-scheduling,
loop iterations during execution. The DLS techniques can be categorized into nonadaptive and
adaptive techniques. The nonadaptive DLS techniques considered in this work are: SS [17],
FSC [1], mFSC [2], GSS [3], TSS [4], FAC [5], and WF [6]. While STATIC represents one
scheduling extreme, SS represents the other scheduling extreme. In SS, the size of each chunk
is one loop iteration. This yields a high load balance with potentially very large scheduling
overhead. FSC assigns loop iterations in chunks of fixed sizes, where the chunk size depends
on the standard deviation of loop iteration execution times σ as an indication of its variation
and the incurred scheduling overhead h. FSC requires this information (h and σ) to be known
before the execution to calculate the chunk size. mFSC alleviates the requirement of pre-
calculating h and σ, and calculates a fixed chunk size that results in a number of chunks
equal to that produced by FAC (described below). GSS assigns loop iterations in chunks of
decreasing sizes, where the size of a chunk is equal to the number of remaining unscheduled
loop iterations R divided by the number of PEs P . Similar to GSS, TSS assigns chunks of loop
iterations in decreasing sizes. Unlike GSS, the chunk sizes decrease linearly, to ease the chunk
calculation operation and to minimize the scheduling overhead. FAC employs a probabilistic
modeling of loop characteristics that takes into account the mean of iterations execution time
µ and their standard deviation σ) to calculate batch sizes that maximize the probability of
achieving a load balanced execution. A PE’s chunk size is equal to the batch size divided by
P . When µ and σ are unavailable, a practical implementation of FAC, assigns half of the
remaining loop iterations R to a batch. WF divides a batch of iterations into unequally-sized
chunks, proportional to the relative PE speeds (called weights). The PE weights need to be
determined prior to the execution and are assumed not to change during execution. This work
considers the practical implementations of FAC and WF. All nonadaptive DLS techniques
account for variations in the iteration execution times due to application characteristics.
The adaptive DLS techniques monitor the performance of the application during execution
and adapt the chunk calculation accordingly. Adaptive DLS techniques include AWF [7],
its variants [8]: AWF-B, AWF-C, AWF-D, AWF-E, and AF [9], among others. AWF is
7designed for time-stepping applications. It improves WF by adapting the relative weights of
PEs during execution by monitoring their performance in each time-step. AWF-B relieves the
time-stepping requirement in AWF, and measures the performance after each batch to update
the PE weights. AWF-C is similar to AWF-B where weight updates are performed upon the
completion of each chunk, instead of a batch. AWF-D is similar to AWF-B and considers
the total chunk time (equal to the sum of the iteration execution times in the chunk plus the
associated overhead of the PE to acquire the chunk) and all the book keeping operations to
calculate and update the PE weights. AWF-B and AWF-C only consider the chunk iterations
execution times. AWF-E is similar to AWF-C by updating the PE weights on every chunk.
Yet AWF-E is also similar to AWF-D by also considering the total chunk time. Unlike FAC,
AF dynamically estimates the values of µ and σ during execution and updates them based on
the measured performance of the PEs on the executed loop iterations.
Loop scheduling in simulation. SimGrid [11] (SG) is a versatile event-based simulation
toolkit designed for the study of the behavior of large-scale distributed systems. It pro-
vides ready to use application programming interfaces (APIs) to represent applications and
computing systems through different interfaces: MSG (SG-MSG), SimDag (SG-SD), and
SMPI (SG-SMPI). SG uses a simple and fast CPU computation model and verified and more
complex network models [18], which render it well suited for the study of computationally-
intensive parallel and distributed scientific applications.
Various studies have used SG to evaluate the performance of applications with DLS tech-
niques in different scenarios [12, 10]. To attain high trustworthiness in the performance results
obtained with SG, the implementation of the nonadaptive DLS techniques in SG-SD has been
verified [19] by reproducing the results presented in the work that introduced factoring [5]. The
accuracy of the performance results obtained by simulative experiments against native experi-
ments has recently also been quantified [20]. The present work employs the SG-SD interface to
study the performance of scientific applications on a heterogeneous computing platform under
perturbations.
Related work. Scheduling of applications on large HPC systems involves many sources of un-
certainties, e.g., task execution times and perturbations in the computing system. Therefore,
many studies have focused on robust schedules that maintain certain performance require-
ments despite fluctuations in the behavior of the system [21]. Robust scheduling of tasks with
uncertain execution and communication times was studied [22] [23] using a multi-objective
evolutionary algorithm and using dynamic scheduling, respectively. Moreover, the flexibil-
ity of dynamic loop scheduling techniques was examined [24] in an effort to select the most
flexible technique using machine learning. However, a robust scheduling technique may not al-
ways guarantee the best performance in all possible execution scenarios and for all application
parameters (e.g. problem size and data distribution). Thus, dynamically selecting the best
performing DLS technique is of paramount importance, given the broad spectrum of available
DLS techniques, each with unique properties. Selecting the best performing DLS technique
for time-stepping applications, using reinforcement learning was introduced [12] by adapting
to the variations in the delivered computational speed during previous time-steps. In addition,
machine learning and decision trees were used to select the best performing DLS technique
dynamically from a portfolio of DLS techniques [10] and for multi-threaded applications par-
allelized with OpenMP [25] or with Charm++ [26]. A knowledge-based rule was design [] to
select a scheduling technique that improves the performance of the application. Application
and system characteristics need to be feed to the rule to select an appropriate scheduling tech-
nique for a portfolio of scheduling techniques. However, perturbations during execution was
not considered, and the selected scheduling technique is not changed during execution.
Scheduling solutions based on optimization techniques, such as, genetic and evolutionary
algorithms, can not adapt to perturbations during execution. None of the aforementioned
8efforts considered perturbations in network bandwidth or latency. This work complements
the previous efforts by studying the performance of scientific applications using nonadaptive
and adaptive DLS techniques under different perturbation scenarios (variations in delivered
computational speed, network bandwidth, and network latency) on a heterogeneous comput-
ing system. A new approach, namely simulator-assisted scheduling (SimAS ) is introduced, to
dynamically select DLS techniques that improve the performance of applications on heteroge-
neous system under multiple sources of perturbations known mostly during execution.
3 Simulator-Assisted Scheduling Approach
(SimAS)
The SimAS is inspired by control theory, where a controller (scheduler) is used to achieve and
maintain a desired state (load balance) of the system (parallel loop execution), as illustrated
in Figure 2(a) and (b). The SimAS concept is motivated by the well-known control strategy
model predictive control (MPC) [27]. The MPC controller predicts the performance of the
system with different control signals to optimize system performance. As shown in Figure 2b,
a call to SimAS is inserted inside a typical scheduling loop. SimAS leverages state-of-the-art
simulation toolkits to estimate the performance of an application in a given execution scenario.
The system monitor and estimator components read the system state during the execution and
update the computing system representation accordingly. The above steps may be repeated
several times during the execution of the loop, and the SimAS call frequency can be aligned
with the perturbations frequency or intensity.
The advantage of SimAS is that it leverages the use of already developed state-of-the-art
simulators to predict the performance dynamically during execution. The prediction accuracy
of a simulator is strongly influenced by the representation of both the applications and the
systems in simulation as well as by the subsystem models it comprises [20]. Given that the
main concern of this work is load imbalanced computationally-intensive applications with
replicated data, the influence of the memory subsystem (e.g. complex memory hierarchy) on
their performance is minimal. Therefore, application performance can accurately be predicted
via simulation. For instance, the percent error between native and simulative executions for a
given application (PSIA [14]) using the SG-SD interface was found to be between 0.95% and
2.99% [20]. The percent error is calculated as
%E = (1− Tsim
Tnative
)× 100 (1)
, where Tnative and Tsim are the native and simulative performance, respectively. Moreover,
it was found that the performance simulations with SimGrid captures the native applications
performance features and identifies the most efficient DLS technique for PSIA and Mandel-
brot applications [28]. It is expected that the accuracy and speed of the simulators employed
by SimAS will improve as they are continuously being developed and refined. The cost of
frequent calls to SimAS can be amortized by launching parallel SimAS instances to concur-
rently derive predictions for various DLS. Alternatively, this cost can be entirely mitigated by
asynchronously calling SimAS , concurrently to the application execution. Upon completion,
SimAS returns as recommendations best suited DLS technique to the calling application, which
can then directly use the recommended DLS to improve its performance.
The system monitor and estimator components can be implemented with a number of
system monitoring tools [29], such as collectl. Such tools can periodically be instantiated
to measure PE and network loads and to update the system representation in the simulator
for the next call to SimAS . The measured chunk execution times can also be used to estimate
9Predicted 
response
Simulated 
control signal
Set 
point Controller
Target 
system
System 
model
State 
estimator
System
Monitor
Control
 signal
Sensor 
measurements
Output
(a) A generic control system.
Scheduler Chunk of tasks execution
State 
estimator
System
Monitor
Chunk size
Measurement of perturbations 
Predicted 
performance
Last scheduled
iteration index
HPC system representation
Loop representation
Loop scheduling portfolio
Scheduling 
simulator
SimAS 
interface
LoopSim
Application DLS4LB
scheduling libraryCalls
SS
FSC
mFSC
…
AF
(b) Proposed SimAS approach for loop scheduling.
Figure 2: The proposed Simulation-assisted scheduling Algorithm Selection (SimAS ) approach for
the selection of DLS techniques. SimAS (b) is analogous to a typical control system (a). The
components highlighted in mint color represent the SimAS additions to a typical loop scheduling
system. The DLS4LB library (c.f. Section 4.2) is used for the parallel task scheduling and exe-
cution, LoopSim (c.f. Section 4.5) is used to predict the application performance with different
DLS techniques under perturbations. SimAS (c.f. Section 4.3) is integrated with DLS4LB to
communicate with LoopSim and to select DLS techniques dynamically during execution.
10
the current PE computational speeds. The PE loads can be estimated and predicted using
autoregressive integrated moving average [30].
4 Experimental Design and Setup
In this work we employ a factorial design of experiments, due to the numerous parameters and
values to explore. The design of the factorial experiments is presented in the following (cf.
Table 1), along with details of the DLS techniques implementation and SimAS , the computing
system under test and its injected perturbations in native and simulative experiments.
Table 1: Design of factorial experiments
Factors Values Properties
Applications
PSIA
Mandelbrot
PSIA TS (time-stepping)
Mandelbrot TS (time-stepping)
Constant
Uniform
Normal
Exponential
Gamma
[5.9 · 107 .. 6.6 · 107] FLOP per iteration
[5.9 · 101 .. 2.6 · 108] FLOP per iteration
[5.9 · 107 .. 6.5 · 107] FLOP per iteration
[5.9 · 101 .. 2.6 · 108] FLOP per iteration
2.3 · 108 FLOP per iteration
[103 .. 7 · 108] FLOP per iteration
µ = 9.5 · 108 FLOP, σ = 7 · 107 FLOP, [6 · 108 .. 1.3 · 109] FLOP per iteration
λ = 1/3 · 108 FLOP, [9.48 · 102 .. 4.5 · 109] FLOP per iteration
k = 2, θ = 108 FLOP, [4.1 · 106 .. 2.7 · 109] FLOP per iteration
Problem size
N = 400,000 iterations, all applications except for
PSIA TS N = 4, 000 iterations per time-step ×10 time-steps
Mandelbrot N = 262, 144 iterations
Mandelbrot TS N = 16, 384 iterations per time-step ×10 time-steps
Loop scheduling
STATIC
SS, FSC, mFSC, GSS, TSS, FAC, WF
AWF-B, -C, -D, -E, AF
Static
Nonadaptive dynamic
Adaptive dynamic
Computing system
miniHPC
(heterogeneous HPC cluster)
22 Intel Broadwell nodes (22 · 20 cores), relative core weight = 0.817
4 Intel Xeon Phi KNL nodes (4 · 64 cores), relative core weight = 0.183
P = 128 heterogeneous (64 Broadwell + 64 KNL) cores
P = 416 heterogeneous (352 Broadwell + 64 KNL) cores
Perturbations
Nominal conditions np (no perturbations)
PE availability
pea-cm (constant mild): µ = 75%, σ = 0%
pea-cs (constant severe): µ = 25%, σ = 0%
pea-em (exponential mild): µ = 78%, σ = 24 · 10−3%
pea-es (exponential severe): µ = 31%, σ = 89 · 10−3%
Bandwidth
bw-cm (constant mild): µ = 1 · 10−5%, σ = 0%
bw-cs (constant severe): µ = 1 · 10−7%, σ = 0%
bw-em (exponential mild): µ = 1.1 · 10−1%, σ = 9 · 10−2%
bw-es (exponential severe): µ = 23 · 10−2%, σ = 19 · 10−2%
Latency
lat-cm (constant mild): µ = 1 · 10−5%, σ = 0%
lat-cs (constant severe): µ = 1 · 10−7%, σ = 0%
lat-em (exponential mild): µ = 1.2 · 10−5%, σ = 1.5 · 10−5%
lat-es (exponential severe): µ = 2.9 · 10−7%, σ = 1.8 · 10−7%
Combined
all-cm (constant mild): pea-cm, bw-cm, and lat-cm
all-cs (constant severe): pea-cs, bw-cs, and lat-cs
all-em (exponential mild): pea-em, bw-em, and lat-em
all-es (exponential severe): pea-es, bw-es, and lat-es
Experimentation
Native1 PSIA and Mandelbrot on 128 and 416 cores under targeted perturbations
Simulative
PSIA and Mandelbrot on 128 and 416 cores under all perturbations
Synthetic applications on 128 and 416 cores under all perturbations
1 Direct experiments on real HPC systems.
11
4.1 Applications
This work considers two real-world applications and five synthetic workloads.
Real applications.
1. PSIA. The parallel spin-image algorithm [14] (PSIA), is a computationally-intensive appli-
cation from computer vision. The PSIA is embarrassingly parallel application and algorithmi-
cally load imbalanced where the computational effort of a loop iteration depends on the input
data. The performance of PSIA has been studied in prior work [14] and was enhanced for
execution on a heterogeneous cluster by using nonadaptive DLS techniques. The total number
of parallel loop iterations in PSIA is 400,000.
2. Mandelbrot. This application computes the Mandelbrot set [15] and generates its image.
The program is based on one of the codes available online3. The application is parallelized such
that the calculation of the value at every single pixel of a 2D image is a loop iteration, that
is performed in parallel. The application is modified to compute the function fc(z) = z
4 + c
instead of fc(z) = z
2 + c to increase the number of computations per task. The size of the
generated image is 512× 512 pixels resulting in 218 parallel loop iterations.
3. PSIA TS. This application is similar to PSIA. Unlike PSIA, PSIA TS is executed in
time-steps. It simulates applying spin-image transformations to an object in motion (a video),
where at each time-step a certain number of spin-images (4, 000) is created. PSIA TS is
executed for 10 time-steps.
4. Mandelbrot TS. This is the time-stepping version of Mandelbrot application. At each
time-step, the generated Mandelbrot set image at time t is zoomed-in by 5% on the center
of the image to generate the image at t + 1. Mandelbrot TS is executed for 10 time-steps.
The workload per time-step is reduced compared to Mandelbrot (single-sweep) such that the
execution time of 10 time-steps of Mandelbrot TS is comparable to the execution time of the
single-sweep version. This is desirable for the purpose of native experimentation given the
large set of experiments performed (see Table 1), to avoid extremely long execution times.
Synthetic workloads.
Five synthetic workloads are examined in this work. Each of the five synthetic workloads con-
tains 400,000 parallel loop iterations. The number of floating point operations (FLOP count)
in each loop iteration is assumed to follow five different probability distributions, namely:
constant, uniform, normal, exponential, and gamma probability distributions. This assump-
tion captures the characteristics of a wide range of applications. The probability distribution
parameters used to generate these FLOP counts are also given in Table 1.
4.2 Loop scheduling
Thirteen loop scheduling techniques are used to assess the performance of the above seven ap-
plications under various execution scenarios. These techniques represent a wide range of static
and dynamic loop scheduling approaches. The dynamic loop scheduling (DLS) techniques can
further be distinguished into five adaptive and seven nonadaptive techniques.
In general, the DLS techniques can be implemented using centralized or decentralized exe-
cution and control approach. The decentralized control approach was found to scale better by
eliminating a centralized master, and hence, the master-level contention [19]. The decentral-
ized control approach was used previously [31] using Intel MPI one-sided communications. The
Intel implementation uses extra threads that run in the background to handle the one-sided
communications. These threads introduce additional overhead during execution, and could
3https://github.com/CaptGreg/SenecaOOP345-attic/blob/master/parallel-pgm/mpi/mandelbrot-mpi-
dynamic.c
12
prevent the application progress if these threads could not find enough computational power
to execute. Therefore, it was found that the centralized two-sided communication implemen-
tation of DLS is more suitable for this work.
The dynamic load balancing tool (DLB tool [16]) is extended and used to parallelize the
applications with dynamic loop self-scheduling and employs MPI two-sided communications
for work distribution among processes. The DLS4LB implements a master-worker execution
model, where the master is responsible for handling work requests from workers. In addition,
the master act also as a worker, and checks for outstanding work requests with a certain
adjustable frequency. The DLS4LB is designed to parallelize an application with minimum
changes. Algorithm 1 shows, in blue font color, the lines needed to be added to the application
code to parallelize it. The DLB tool originally contained the implementation of nine loop
scheduling techniques, namely STATIC, mFSC, GSS, FAC, AWF-B, AWF-C, AWF-D, AWF-
E, and AF. In this work, the tool is extended into DLS4LB to support four more dynamic
loop scheduling techniques, namely SS, FSC, TSS, and WF.
4.3 Simulation-assisted Algorithm Selection
In this work, the DLS4LB is extended to support the SimAS as the fourteenth option in the
DLS4LB . Taking the same approach of the DLS4LB of minimal application code changes, an
application can use the SimAS by inserting only two function calls, shown in green font color,
in Algorithm 1.
The SimAS setup function sets up the main data structure SimAS info that holds impor-
tant information, such as the number of PEs, the number of loop iterations, the path to the
simulator, the FLOP file that contains the FLOP count per loop iteration, and the platform
file that describes the computing system. In addition, SimAS setup asynchronously starts the
simulation of the application performance immediately with a portfolio of DLS techniques in
parallel. The SimAS setup sets the scheduling technique to a default DLS (AWF-B in this
work), to allow the application to start and avoid delaying the application execution.
The SimAS update checks (every 5 seconds in this work) if the simulation is finished, and
selects the DLS technique allows the application to finish the largest number of tasks in the
shortest time; otherwise it will keep the selected DLS technique unchanged. The SimAS update
reruns the simulation again if 50 seconds (the SimAS calling frequency) have passed since the
simulator was previously called. The SimAS update prevents the start of a new instance of
the simulator unless the earlier one is completed or the number of remaining unscheduled
iterations is less than or equal the number of PEs.
4.4 Computing system
The native experiments were conducted on miniHPC 4, a research and teaching cluster at
the Department of Mathematics and Computer Science at the University of Basel, Switzer-
land. It consists of 26 compute nodes: 22 nodes each with one dual socket Intel Xeon E5-
2640 v4 (20 cores) configuration and 4 nodes each with one Intel Xeon Phi Knights Landing
7210 processor (64 cores). All nodes are interconnected with Intel Omni-Path fabrics in a
nonblocking two-level fat-tree topology.
4miniHPC is a fully controlled non-production HPC cluster at the Department of Mathematics and Computer
Science at the University of Basel, Switzerland.
13
Algorithm 1: Dynamic load balancing with SimAS support using the extended DLB tool
Data: SimAS info, DLS info, h, σ, N, P
1 #include <mpi.h>
2 #include “DLB SimAS”
3 MPI Init(&argc, &argv); MPI Comm size(MPI COMM WORLD, &nprocs);
MPI Comm rank(MPI COMM WORLD, &myid);
4 scheduling method = SimAS setup(SimAS info, P, N, h, sigma, sim path, FLOP file,
platform file);
5 DLS setup(MPI COMM WORLD, DLS info);
6 DLS startLoop (DLS info, N, scheduling method);
7 while not DLS terminated(DLS info) do
8 SimAS update(DLS info, SimAS info);
9 DLS startChunk(DLS info, Cstart, Csize);
10 Compute iterations(Cstart, size);
11 DLS endChunk(DLS info);
12 end
13 DLS endLoop(DLS info);
14
4.5 Simulation
Applications.
LoopSim5, an SG-SD-based simulator, is used to simulate the applications of interest,
where the loop iterations in the application code are represented as tasks [20]. To represent
the computational effort associated with an application’s loop iterations, the number of floating
point operations (FLOP) of each loop iteration is counted using PAPI counters [32]. The FLOP
count per iteration is then read by LoopSim during execution to simulate the computation per
iteration. All DLS techniques supported by the DLS4LB are also implemented in LoopSim
and tasks are assigned to free and requesting simulated cores, similar to the native execution.
The pseudocode of LoopSim is presented in Listing 1. LoopSim reads in the number of
iterations (tasks), start task ID, the path to the file that contains the FLOP count per loop
iteration, the path to the computing system representation (see below), the selected scheduling
technique, and the maximum simulated time. The simulator reads the data and simulates the
loop execution using the selected DLS technique. It then outputs the simulated time and the
number of tasks executed in this simulated time. This information is read by the SimAS , which
compares different DLS techniques based on this information and selects the DLS technique
that results in the shortest execution time and largest number of finished tasks.
Listing 1: SG-SD loop simulator
#include <simdag . h>
#include <DLS\ s c h e d u l i n g . h>
//read input
read input ( num tasks , FLOP file , s t a r t t a s k I D , \
p l a t f o r m f i l e , DLS t , max sim t ) ;
//create tasks that represent loop iterations
Task array = c r e a t e t a s k s ( num tasks , FLOP fi le ) ;
s chedu l ed ta sk s = 0 ;
while(executed tasks ¡ num tasks) && (get sim time() ¡ max sim t)
{
i d l e p r o c e s s e s = g e t i d l e p r o c e s s e s ( ) ;
foreach(idle process in idle processes)
{
//read and update finsihed tasks
execu t ed ta sk s += g e t f i n i s h e d t a s k s ( i d l e p r o c e s s ) ;
// send work request to master
send work request ( i d l e p r o c e s s , master ) ;
chunk = ca l cu l a t e chunk ( Task array , num tasks , \
s chedu l ed ta sks , DLS t ) ;
//assign work to worker
send work ( master , i d l e p r o c e s s ) ;
s chedu l ed ta sk s += chunk ;
}
//resume simulation untill a task is finished, i.e., a process is idle
s i mu la t e ex e c u t i on ( p l a t f o r m f i l e ) ;
}
pr in t ( ” s imulated time : ” + get s im t ime ( ) ) ;
p r i n t ( ” f i n i s h e d ta sk s : ” + execu t ed ta sk s ) ;
5https://github.com/unibas-dmi-hpc/LoopSim
15
Computing system. A computing system is represented in SG via an XML file denoted as
platform file. SG registers each processor core from their representation as a host in the
platform file. The computational speed of a processor core is estimated by measuring a
loop execution time and dividing it by the total number of floating point operations included in
the loop [20]. A Xeon core was found to be four times faster than a Xeon Phi core as indicated
by the relative core weights (cf. Table 1). The network bandwidth and latency represented in
the platform file are calibrated with the SG calibration procedure6.
4.6 Perturbations
Three different categories of perturbations are considered in this work, namely delivered compu-
tational speed, available network bandwidth, and available network latency. Two intensities are
considered, mild and severe, for each category. Two scenarios are considered for each inten-
sity, where the value of the delivered computational speed is either constant or exponentially
distributed.
All perturbations (cf. Table 1) are considered to occur periodically, with a period of 100 sec-
onds where the perturbations affect the system only during 50% of the perturbation period.
The network (bandwidth and latency) perturbations commence with the application execu-
tion, while the delivered computational speed perturbations begin 50 seconds after the start
of the application. Another perturbation scenario is created by combining all perturbations
from the other individual categories.
Perturbations in simulative experiments. All perturbations are enacted in SG during
simulation via the availability, bandwidth, latency, and platform files to represent pertur-
bations in delivered computational speed, network available bandwidth, and network latency,
respectively.
Perturbations in native experiments. A program (CPU burner) is launched in parallel
and pinned on the same processor cores as the application to induce perturbations on the PE
availability in native execution. The program is executed periodically every 100 seconds and
is only active during a fraction of this period that corresponds to the required PE availability
perturbation (75% or 25%).
For injecting perturbations in the link latency, the MPI communication functions are inter-
cepted using the MPI profiling interface (PMPI), and certain delays are inserted to simulate
longer communication latencies. Given that the applications of interest are computationally-
intensive and the communicated data size between application’s processes is minimal, per-
turbations in the network bandwidth does not have a significant effect on the application
performance, as can be seen from the simulative experiments below. Therefore, perturbations
in the network bandwidth are excluded from native experimentation.
A combined perturbations scenario is created for the native execution by combining PE
availability perturbations and network latency perturbations. As both perturbation distri-
butions (constant and exponential) have a comparable effect on the performance, where the
impact of constantly distributed perturbations is more evident, only the constant distribution
of perturbations is considered in the native experiments.
5 Evaluation and Analysis
An analysis of the load imbalance of the real applications considered in this work is presented
in this section. The performance results of the execution of the applications with different loop
scheduling techniques under different execution scenarios are illustrated and discussed.
6http://simgrid.gforge.inria.fr/contrib/smpi-calibration-doc/
16
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0
100
200
300
400
500
600
700
800
900
E
x
e
cu
ti
o
n
 t
im
e
 (
s)
T looppar
(a) PSIA performance on 416 cores
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
C
.o
.v
. 
o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
Load imbalance T looppar
c.o.v.
mean/max
0.0
0.2
0.4
0.6
0.8
1.0 M
e
a
n
/m
a
x
 o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
(b) Load imbalance of PSIA on 416 cores
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0
100
200
300
400
500
600
700
800
900
E
x
e
cu
ti
o
n
 t
im
e
 (
s)
T looppar
(c) Mandelbrot performance on 416 cores
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
C
.o
.v
. 
o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
Load imbalance T looppar
c.o.v.
mean/max
0.0
0.2
0.4
0.6
0.8
1.0 M
e
a
n
/m
a
x
 o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
(d) Load imbalance of Mandelbrot on 416 cores
Figure 3: Execution load imbalance of the native execution of PSIA and Mandelbrot on 416 het-
erogeneous cores of the miniHPC system under no perturbations. The parallel loop execution time
T looppar and the median of the load imbalance metrics over five executions are reported.
5.1 Load imbalance in PSIA and Mandelbrot
Both, PSIA and Mandelbrot, applications suffer from load imbalance that stems from the
variation in the number of computational operations per loop iteration. The number of com-
putations varies in both applications due to a conditional statement in their code that can
increase or decrease the number of computations per loop iteration based on the input data.
As a measure of the variation of the loop iteration execution times for both applications, the
standard deviation of loop iterations execution times σ is derived for both applications by
means of their sequential execution on a single processor core (to avoid any parallelization
overheads). The median of 20 measurements of σ for PSIA was found to be 0.00327, whereas
it was one order of magnitude higher for the Mandelbrot, namely 0.06056.
Two metrics are considered to measure the load imbalance of the parallel execution of
the applications on the miniHPC, namely the coefficient of variation (c.o.v.) of the parallel
processes finishing times [5] and the ratio of the mean process finishing times to the maximum
process finishing time (mean/max). The c.o.v. is calculated as the ratio between the standard
17
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0
500
1000
1500
2000
E
x
e
cu
ti
o
n
 t
im
e
 (
s)
T looppar
(a) PSIA performance on 128 cores
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
C
.o
.v
. 
o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
Load imbalance T looppar
c.o.v.
mean/max
0.0
0.2
0.4
0.6
0.8
1.0 M
e
a
n
/m
a
x
 o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
(b) Load imbalance of PSIA on 128 cores
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0
500
1000
1500
2000
E
x
e
cu
ti
o
n
 t
im
e
 (
s)
T looppar
(c) Mandelbrot performance on 128 cores
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
Loop scheduling technique
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
C
.o
.v
. 
o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
Load imbalance T looppar
c.o.v.
mean/max
0.0
0.2
0.4
0.6
0.8
1.0 M
e
a
n
/m
a
x
 o
f 
p
ro
ce
ss
 f
in
is
h
in
g
 t
im
e
s
(d) Load imbalance of Mandelbrot on 128 cores
Figure 4: Execution load imbalance of the native execution of PSIA and Mandelbrot on 128 het-
erogeneous cores of the miniHPC system under no perturbations. The parallel loop execution time
T looppar and the median of the load imbalance metrics over five executions are reported.
deviation of processes finishing times to their mean value. A severe execution load imbalance
corresponds to a high value of c.o.v. and a low value of mean/max. Figure 3b and Figure 3d
show the median of the two load imbalance metrics over five executions. Both metrics indicate
high load imbalance for the applications executed with STATIC, FSC, mFSC, GSS, TSS, and
FAC, which correspond to longer parallel loop execution times in Figure 3a and Figure 3c. A
similar performance can also be observed in Figure 4 for PSIA and Mandelbrot executed on
128 heterogeneous cores of the miniHPC system. An inspection of the Mandelbrot execution
times with AF, for both system sizes, reveals that the high variation in the execution time
with 416 cores is due to the small number of loop iterations per PE. The small number of loop
iterations per PE and the high variation of loop iterations execution times of Mandelbrot did
not offer sufficient opportunity to AF to accurately learn the PE weights. In the execution
of Mandelbrot on 128 cores, the number of loop iterations per core is higher than on 416
cores as the problem size is fixed. This allowed the improved and more stable performance of
Mandelbrot with AF on 128 cores.
18
5.2 Performance of Scientific Applications under Perturba-
tions
Simulative experiments. The simulative performance results of the two real applications,
PSIA and Mandelbrot, under perturbations are shown in Figures 5 - 8. One can note that
STATIC, GSS, TSS, and FAC perform poorly on heterogeneous systems. Also, WF can not
accommodate the variability in the system due to perturbations, especially to perturbations
in the delivered computational speed of the PEs. The performance of FSC and mFSC is, in
general, higher than that of STATIC, GSS, TSS, FAC, and WF. However, FSC and mFSC are
highly affected by the perturbations in the PE availability. SS is resilient to perturbations in
the delivered computational speed of the PEs. However, it is significantly influenced by the
network latency variations, as can be seen in Figures 5 - 8 lat-cs and lat-es.
Perturbations in the network bandwidth show a minimal influence on performance, as the
PEs only communicate loop iteration indices to calculate the start index of the next chunk.
Therefore, the communicated messages are small. The bandwidth perturbations are, thus, not
selected for subsequent more targeted native experiments under perturbations.
The adaptive techniques perform comparably except for AWF-E in Mandelbrot on 128
cores (Figure 7a), with a slight advantage for AWF-B as can be seen in Figure 8a all-cs
and all-es. However, in certain cases, other techniques outperform the adaptive techniques.
Specifically, SS outperforms AWF-B in Figure 7a and Figure 8a pea-cs and pea-es.
These results suggest that no single DLS outperforms all other techniques in all execution
scenarios. Therefore, the best strategy is to dynamically select a DLS based on the current
application and system states. The SimAS is called every 50 seconds, when there is a work
request, to select the best performing DLS. The DLS techniques with poor performance on
heterogeneous systems, i.e., GSS, TSS, and FAC, are excluded from the DLS portfolio provided
to the SimAS to speed up the simulation. A closer analysis of the SimAS -based results
reveals that it resulted in the shortest execution time in most execution scenarios, especially
for Mandelbrot, as shown in Figure 8a lat-cs and lat-es, and for PSIA pea-cm in Figure 5a
and Figure 6a. In other cases, the application performance with SimAS was slightly poorer
than the best execution time achieved by other DLS techniques. This is due to the fact
that loop scheduling is, by definition, non-preemptive and the execution of already scheduled
loop iterations can not be preempted to be resumed with the newly (expected more suitable)
selected DLS. Inspecting the simulation results of the synthetic workloads in Figures 9 - 18,
one can see that the same observations from the real applications are also confirmed by the
results of synthetic workloads.
Native experiments.
19
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.2
160.3
110.3
132.3
100.0
100.0
100.0
100.0
100.0
100.3
100.0
100.0
114.4
160.5
110.3
132.5
33.2
35.4
40.3
35.2
39.9
33.1
36.8
33.1
33.1
34.9
59.7
33.3
46.8
36.4
86.4
35.2
55.3
44.1
50.0
69.1
49.6
66.4
44.0
44.0
44.0
44.0
44.1
44.3
44.1
44.1
50.8
72.9
49.6
66.5
34.2
42.4
42.6
42.4
49.3
34.2
34.2
34.2
34.2
34.2
35.4
34.2
34.6
42.3
43.6
42.3
52.9
58.5
66.7
93.9
66.0
87.2
58.4
58.4
58.4
58.4
58.5
58.5
58.5
58.5
66.9
94.2
65.9
91.0
42.9
48.8
67.7
48.4
64.5
42.9
42.9
42.9
42.9
42.9
42.9
42.9
42.9
49.3
71.4
48.4
65.3
49.9
57.0
78.3
56.4
75.3
49.9
49.9
49.9
49.9
49.9
49.9
49.9
49.9
57.1
81.8
56.3
75.6
33.0
35.2
40.1
35.0
39.7
32.9
32.9
32.9
32.9
33.0
34.1
33.0
33.5
35.3
43.2
34.9
40.6
33.0
35.1
40.1
35.0
39.7
32.9
33.0
32.9
32.9
33.0
34.1
33.0
33.4
35.3
42.6
34.8
42.0
33.0
35.1
40.1
35.0
39.7
33.0
33.5
33.0
33.0
33.0
34.5
33.0
33.5
35.3
43.8
34.8
40.7
33.0
35.1
40.1
35.0
39.7
32.9
33.0
32.9
32.9
33.0
34.1
33.0
33.4
35.3
42.6
34.8
42.0
33.0
35.1
40.1
35.0
39.7
33.0
33.5
33.0
33.0
33.0
34.5
33.0
33.5
35.3
43.8
34.8
40.7
33.0
35.1
40.1
35.0
39.7
33.0
33.7
33.0
33.0
33.0
34.7
33.0
33.5
35.3
46.6
34.8
40.6
37.3
36.5
40.2
36.5
39.8
47.9
49.2
37.2
37.3
37.3
34.6
37.3
33.4
35.4
69.4
34.9
44.3
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) PSIA simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 44.4% 0.0% 22.2% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 25.0% 0.0% 25.0% 0.0% 0.0% 0.0% 0.0% 25.0% 12.5% 0.0% 12.5% 0.0%
all-cs 0.0% 14.3% 7.1% 14.3% 0.0% 0.0% 0.0% 42.9% 21.4% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 37.5% 0.0% 12.5% 0.0% 0.0% 0.0% 12.5% 37.5% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 0.0% 28.6% 0.0% 0.0% 0.0% 28.6% 28.6% 0.0% 0.0% 0.0% 14.3%
lat-em 0.0% 0.0% 12.5% 25.0% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 12.5%
lat-cs 0.0% 0.0% 0.0% 28.6% 0.0% 0.0% 0.0% 28.6% 28.6% 14.3% 0.0% 0.0% 0.0%
lat-cm 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 25.0%
bw-es 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 25.0%
bw-em 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 25.0%
bw-cs 0.0% 0.0% 10.0% 40.0% 0.0% 0.0% 0.0% 30.0% 10.0% 10.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 0.0% 20.0% 20.0% 0.0% 0.0% 0.0% 40.0% 20.0% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 55.6% 0.0% 11.1% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 12.5% 0.0% 37.5% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 44.4% 0.0% 33.3% 0.0% 0.0% 0.0% 11.1% 11.1% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 12.5% 0.0% 37.5% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 0.0%
np 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 25.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 5: Simulative performance results of PSIA without (denoted with np) and with (the rest)
perturbations using SimAS and other thirteen loop scheduling techniques on 128 cores of miniHPC.
Percent performance improvement normalized to STATIC in np scenario (baseline case without any
perturbations and baseline load balancing method). White, red, and blue denote baseline (= 100%),
degraded (> 100%), and improved performance (< 100%), respectively. The table shows the DLS
techniques dynamically selected by SimAS during execution.
20
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
112.8
157.0
111.8
148.4
99.9
99.9
99.9
99.9
100.1
100.0
100.0
100.0
114.5
169.0
110.8
149.4
22.9
24.4
28.6
24.3
28.3
22.8
24.0
22.7
22.7
25.5
38.0
23.1
34.2
24.7
65.7
24.1
43.4
36.2
40.0
36.2
39.9
36.2
36.0
36.2
36.1
36.1
36.4
37.2
36.2
36.4
43.4
36.9
39.7
36.3
30.3
34.0
31.8
33.9
31.1
30.2
30.2
30.2
30.2
30.4
32.0
30.3
30.3
34.9
41.0
33.9
31.3
42.8
47.8
65.9
47.5
63.9
42.6
42.6
42.6
42.6
42.8
42.8
42.8
42.8
50.3
78.2
46.5
69.3
50.2
56.7
72.6
56.3
70.6
50.0
50.0
50.0
50.0
50.2
51.1
50.2
50.2
57.1
84.9
55.2
81.5
50.2
56.7
72.6
56.3
70.6
50.0
50.0
50.0
50.0
50.2
51.1
50.2
50.2
57.1
84.9
55.2
81.5
22.8
24.3
29.0
24.3
28.2
22.7
22.7
22.7
22.7
22.9
28.4
22.8
23.4
24.6
33.8
23.7
28.7
22.8
24.3
31.8
24.3
31.1
22.7
22.7
22.7
22.7
22.9
28.1
22.8
23.4
24.6
41.0
23.7
31.3
22.8
24.3
31.8
24.3
31.1
22.7
24.0
22.7
22.7
22.9
28.7
22.8
23.6
24.6
41.6
23.7
31.3
22.8
24.3
31.8
24.3
31.1
22.7
22.7
22.7
22.7
22.9
28.1
22.8
23.4
24.6
41.0
23.7
31.3
22.8
24.3
31.8
24.3
31.1
22.7
24.0
22.7
22.7
22.9
28.7
22.8
23.6
24.6
41.6
23.7
31.3
22.8
24.7
33.2
24.6
32.4
22.7
29.1
22.7
22.7
22.9
32.0
22.8
25.5
24.6
41.9
23.7
31.3
22.8
24.7
29.0
24.6
28.2
22.7
22.7
30.0
30.0
22.9
35.4
30.2
23.4
24.6
33.2
23.7
49.3
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) PSIA simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0%
all-cs 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 50.0%
lat-cs 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-es 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 50.0%
bw-em 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 50.0%
bw-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0%
bw-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0%
pea-es 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0%
pea-cs 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0%
np 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 6: Simulative performance results of PSIA without (denoted with np) and with (the rest)
perturbations using SimAS and other thirteen loop scheduling techniques on 416 cores of miniHPC.
Percent performance improvement normalized to STATIC in np scenario (baseline case without any
perturbations and baseline load balancing method). White, red, and blue denote baseline (= 100%),
degraded (> 100%), and improved performance (< 100%), respectively. The table shows the DLS
techniques dynamically selected by SimAS during execution.
21
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.4
159.3
111.5
135.9
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.4
163.5
111.4
136.2
35.5
37.7
43.4
37.6
43.0
35.4
38.0
35.4
35.4
37.3
64.5
36.1
48.2
39.2
89.5
38.4
57.1
38.2
41.1
45.9
41.0
45.7
38.2
38.1
38.2
38.2
38.0
40.4
38.0
38.9
41.0
50.5
41.0
48.3
42.3
44.7
53.6
45.4
50.6
42.3
42.4
42.3
42.3
42.5
43.5
42.4
42.8
45.9
58.8
44.6
52.9
59.6
67.5
95.1
66.8
87.9
59.5
59.5
59.5
59.5
59.6
59.9
59.6
59.6
68.7
97.2
66.8
91.9
44.4
50.7
70.0
50.2
65.1
44.3
44.3
44.3
44.3
44.4
44.4
44.4
44.4
50.9
72.3
49.8
69.3
50.3
56.9
77.9
56.4
74.9
50.2
50.2
50.2
50.2
50.3
50.6
50.3
50.3
58.1
82.1
56.4
77.4
36.8
38.4
49.4
38.5
46.6
36.8
36.8
36.8
36.8
36.9
37.0
36.8
36.8
39.0
50.8
38.0
51.9
82.3
94.7
116.6
92.4
104.7
82.2
82.2
82.2
82.2
215.3
85.2
216.6
82.5
91.1
81.0
226.5
70.6
36.5
38.7
44.7
38.4
43.7
37.2
40.3
36.5
36.5
37.0
37.2
39.2
39.6
39.0
49.6
38.3
44.5
82.3
94.7
116.6
92.4
104.7
82.2
82.2
82.2
82.2
215.3
85.2
216.6
82.5
91.1
81.0
226.5
70.6
36.5
38.7
44.7
38.4
43.7
37.2
40.3
36.5
36.5
37.0
36.8
39.2
37.2
39.0
48.0
38.3
44.2
49.9
41.6
46.0
41.2
45.5
49.9
47.9
49.9
49.9
49.6
52.6
47.8
51.1
44.3
49.6
42.5
55.7
36.5
46.4
44.2
42.3
46.0
36.5
37.3
36.5
36.5
41.9
41.9
35.5
36.4
44.0
49.8
41.0
48.4
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Mandelbrot simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 11.1% 55.6% 0.0% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 12.5% 37.5% 37.5% 0.0% 0.0% 0.0% 12.5% 0.0% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 10.0% 40.0% 20.0% 0.0% 0.0% 0.0% 20.0% 10.0% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 25.0% 25.0% 12.5% 0.0% 0.0% 0.0% 12.5% 25.0% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 14.3% 28.6% 14.3% 0.0% 0.0% 0.0% 14.3% 28.6% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 28.6% 42.9% 0.0% 0.0% 0.0% 0.0% 0.0% 28.6% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 12.5% 25.0% 25.0% 0.0% 0.0% 0.0% 25.0% 12.5% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 25.0% 12.5% 0.0% 0.0% 12.5% 0.0%
bw-es 0.0% 14.3% 42.9% 14.3% 0.0% 0.0% 0.0% 14.3% 14.3% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 14.3% 42.9% 14.3% 0.0% 0.0% 0.0% 14.3% 14.3% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 14.3% 42.9% 28.6% 0.0% 0.0% 0.0% 14.3% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 14.3% 42.9% 14.3% 0.0% 0.0% 0.0% 14.3% 14.3% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 22.2% 33.3% 33.3% 0.0% 0.0% 0.0% 11.1% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 25.0% 25.0% 25.0% 0.0% 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 22.2% 44.4% 11.1% 0.0% 0.0% 0.0% 11.1% 11.1% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0% 33.3% 22.2% 0.0% 11.1% 0.0% 0.0%
np 0.0% 14.3% 42.9% 14.3% 0.0% 0.0% 0.0% 14.3% 14.3% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 7: Simulative performance results of Mandelbrot without (denoted with np) and with
(the rest) perturbations using SimAS and other thirteen loop scheduling techniques on 128 cores
of miniHPC. Percent performance improvement normalized to STATIC in np scenario (baseline
case without any perturbations and baseline load balancing method). White, red, and blue denote
baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively. The
table shows the DLS techniques dynamically selected by SimAS during execution.
22
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
113.2
159.4
112.6
155.1
99.8
99.8
99.8
99.8
100.1
101.6
100.0
100.0
115.4
167.9
112.3
155.7
39.9
42.7
49.9
42.5
49.2
39.7
41.3
39.6
39.7
43.3
64.6
40.4
56.0
43.1
112.2
42.4
71.0
43.8
47.6
53.7
47.6
53.0
43.6
43.7
43.6
43.6
44.0
55.2
43.8
46.0
46.9
80.2
46.8
59.8
52.6
58.7
63.8
58.3
61.5
52.4
51.7
52.4
52.4
53.1
60.1
54.4
55.0
56.0
88.3
58.1
75.9
68.6
75.6
108.2
75.3
104.8
68.4
68.4
68.4
68.4
68.7
68.6
68.6
68.6
81.6
121.5
74.8
104.6
80.2
91.1
119.8
90.5
116.5
80.0
80.0
80.0
80.0
80.4
80.2
80.3
80.2
93.2
139.4
88.7
135.2
80.2
91.1
119.8
90.5
116.5
80.0
80.0
80.0
80.0
80.4
80.2
80.3
80.2
93.2
139.4
88.7
135.2
40.2
44.2
55.4
44.2
54.3
39.8
40.2
40.0
40.0
40.5
53.5
40.1
40.9
43.8
59.4
43.0
50.1
92.7
105.9
148.7
105.2
134.2
121.1
121.1
121.1
121.1
77.9
88.7
94.7
87.1
77.0
75.3
94.5
79.4
52.4
53.2
59.8
48.6
58.5
47.4
47.5
47.4
47.4
52.8
50.6
45.8
46.5
47.4
67.7
50.5
56.2
92.7
105.9
148.7
105.2
134.2
121.1
121.1
121.1
121.1
77.9
88.7
94.7
87.1
77.0
75.3
94.5
79.4
45.6
54.2
59.8
54.2
58.5
45.3
45.5
46.2
46.2
52.6
50.6
55.9
45.7
51.5
65.5
52.2
53.0
44.5
47.6
59.0
47.3
57.7
44.3
48.4
44.3
44.3
47.7
49.6
48.1
44.2
44.6
63.5
54.3
52.4
44.5
46.2
58.0
46.0
57.7
47.4
48.4
47.4
47.4
52.6
88.7
43.3
40.9
43.1
60.1
48.5
50.1
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Mandelbrot simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 50.0% 0.0% 0.0% 0.0%
all-em 0.0% 50.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 50.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 75.0% 25.0% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0%
bw-es 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0%
bw-em 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 50.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0%
pea-es 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0%
pea-em 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 50.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0%
np 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 8: Simulative performance results of Mandelbrot without (denoted with np) and with
(the rest) perturbations using SimAS and other thirteen loop scheduling techniques on 416 cores
of miniHPC. Percent performance improvement normalized to STATIC in np scenario (baseline
case without any perturbations and baseline load balancing method). White, red, and blue denote
baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively. The
table shows the DLS techniques dynamically selected by SimAS during execution.
23
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.3
159.8
102.9
109.1
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.3
160.8
102.9
109.1
33.6
37.9
48.4
36.3
41.1
33.7
38.8
33.6
33.6
34.0
58.9
33.6
38.1
38.7
107.6
36.3
44.6
33.6
37.9
48.4
36.3
41.1
33.7
38.9
33.6
33.6
34.0
59.2
33.6
37.7
38.6
106.6
36.3
44.5
34.7
43.0
43.0
37.3
43.5
34.6
34.6
34.6
34.6
34.7
35.0
34.7
34.7
43.0
43.3
37.3
43.6
60.5
69.2
96.4
63.4
69.6
60.5
60.5
60.5
60.5
60.6
60.5
60.5
60.5
69.2
97.5
63.4
69.7
43.9
50.2
69.8
46.8
52.9
43.9
43.9
43.9
43.9
43.9
43.9
43.9
43.9
50.2
70.8
46.8
53.0
50.0
57.0
79.9
52.9
59.1
50.0
50.0
50.0
50.0
50.0
50.0
50.0
50.0
57.3
80.5
52.9
59.2
33.5
35.7
41.2
35.0
38.1
33.5
33.5
33.5
33.5
33.5
33.8
33.5
33.5
35.9
41.7
35.0
38.1
33.5
35.7
41.1
35.0
38.1
33.5
33.6
33.5
33.5
33.5
33.9
33.5
33.6
35.9
41.9
35.0
38.2
33.5
35.7
41.1
35.0
38.1
33.5
33.7
33.5
33.5
33.5
33.9
33.5
33.6
35.8
42.2
35.0
38.2
33.5
35.7
41.1
35.0
38.1
33.5
33.6
33.5
33.5
33.5
33.9
33.5
33.6
35.9
41.9
35.0
38.2
33.5
35.7
41.1
35.0
38.1
33.5
33.7
33.5
33.5
33.5
34.0
33.5
33.6
35.8
42.2
35.0
38.2
33.5
35.7
41.1
35.0
38.1
33.5
33.7
33.5
33.5
33.5
33.9
33.5
33.5
35.8
42.8
35.0
38.2
35.4
40.7
56.0
37.9
43.4
35.3
39.2
35.4
35.4
35.3
55.2
35.2
39.2
40.3
100.1
37.9
46.0
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Constant workload simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 36.4% 54.5% 3.0% 0.0% 0.0% 0.0% 3.0% 0.0% 3.0% 0.0% 0.0% 0.0%
all-em 0.0% 46.4% 39.3% 3.6% 0.0% 0.0% 0.0% 10.7% 0.0% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 28.8% 57.5% 2.7% 0.0% 0.0% 0.0% 9.6% 1.4% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 58.6% 24.1% 6.9% 0.0% 0.0% 0.0% 10.3% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 42.9% 42.9% 7.1% 0.0% 0.0% 0.0% 7.1% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 38.5% 38.5% 11.5% 0.0% 0.0% 0.0% 7.7% 0.0% 0.0% 0.0% 0.0% 3.8%
lat-cs 0.0% 53.7% 31.7% 4.9% 0.0% 0.0% 0.0% 7.3% 0.0% 0.0% 0.0% 0.0% 2.4%
lat-cm 0.0% 46.2% 30.8% 7.7% 0.0% 0.0% 0.0% 11.5% 0.0% 0.0% 0.0% 3.8% 0.0%
bw-es 0.0% 0.0% 80.0% 4.0% 0.0% 0.0% 0.0% 12.0% 4.0% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 0.0% 80.0% 4.0% 0.0% 0.0% 0.0% 12.0% 4.0% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 39.3% 42.9% 7.1% 0.0% 0.0% 0.0% 10.7% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 4.0% 80.0% 8.0% 0.0% 0.0% 0.0% 4.0% 0.0% 0.0% 0.0% 4.0% 0.0%
pea-es 0.0% 29.0% 51.6% 0.0% 0.0% 0.0% 0.0% 12.9% 3.2% 3.2% 0.0% 0.0% 0.0%
pea-em 0.0% 77.8% 11.1% 3.7% 0.0% 0.0% 0.0% 7.4% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 35.0% 40.0% 0.0% 0.0% 0.0% 0.0% 20.0% 2.5% 2.5% 0.0% 0.0% 0.0%
pea-cm 0.0% 72.4% 6.9% 0.0% 0.0% 0.0% 0.0% 10.3% 3.4% 6.9% 0.0% 0.0% 0.0%
np 0.0% 0.0% 80.0% 4.0% 0.0% 0.0% 0.0% 12.0% 4.0% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 9: Simulative performance results of Constant synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 128 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
24
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.0
158.2
109.4
129.4
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.6
161.7
109.4
129.4
23.1
26.2
34.3
26.1
33.8
23.0
26.5
23.0
23.0
23.7
43.0
23.1
33.0
27.2
89.8
26.2
47.7
23.1
26.2
34.4
26.1
33.9
23.1
27.1
23.0
23.0
23.6
42.9
23.1
32.8
27.0
93.5
26.2
48.3
30.3
34.6
31.6
34.3
30.3
30.2
30.2
30.2
30.2
30.3
30.8
30.3
30.3
34.6
33.6
34.2
32.5
42.8
48.8
68.7
48.2
63.1
42.8
42.8
42.8
42.8
42.9
42.8
42.8
42.8
49.3
68.9
48.0
66.0
50.0
56.9
79.1
56.2
75.0
50.0
50.0
50.0
50.0
50.0
50.0
50.0
50.0
57.5
82.6
55.9
76.3
50.0
56.9
79.1
56.2
75.0
50.0
50.0
50.0
50.0
50.0
50.0
50.0
50.0
57.5
82.6
55.9
76.3
23.0
24.6
27.9
24.5
27.5
23.0
23.0
23.0
23.0
23.1
24.3
23.0
23.5
24.6
30.6
24.4
29.1
23.0
24.6
31.6
24.5
29.2
23.0
23.0
23.0
23.0
23.0
24.5
23.0
23.6
24.6
33.6
24.4
32.5
23.0
24.6
31.6
24.5
29.2
23.0
23.3
23.0
23.0
23.0
25.0
23.0
23.7
24.6
33.7
24.4
32.5
23.0
24.6
31.6
24.5
29.2
23.0
23.0
23.0
23.0
23.0
24.5
23.0
23.6
24.6
33.6
24.4
32.5
23.0
24.6
31.6
24.5
29.2
23.0
23.3
23.0
23.0
23.0
25.0
23.0
23.7
24.6
33.7
24.4
32.5
23.0
24.6
31.6
24.5
29.2
23.0
25.7
23.0
23.0
23.0
24.6
23.0
23.7
24.7
35.2
24.3
32.5
29.0
25.2
36.8
25.1
28.9
24.6
24.6
29.0
29.0
30.7
27.7
24.6
30.9
25.3
35.5
24.6
34.0
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Constant workload simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 25.0% 12.5% 12.5% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 12.5% 12.5% 25.0% 0.0% 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 16.7% 0.0% 16.7% 0.0% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 20.0% 0.0% 20.0% 0.0% 0.0% 0.0% 20.0% 20.0% 0.0% 0.0% 0.0% 20.0%
lat-cs 0.0% 42.9% 0.0% 14.3% 0.0% 0.0% 0.0% 28.6% 14.3% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 0.0% 16.7% 16.7% 0.0% 0.0% 0.0% 33.3% 16.7% 16.7% 0.0% 0.0% 0.0%
bw-es 0.0% 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 16.7% 33.3% 16.7% 0.0% 0.0% 0.0%
bw-em 0.0% 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 16.7% 33.3% 16.7% 0.0% 0.0% 0.0%
bw-cs 0.0% 0.0% 20.0% 40.0% 0.0% 0.0% 0.0% 20.0% 20.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 0.0% 20.0% 40.0% 0.0% 0.0% 0.0% 0.0% 20.0% 20.0% 0.0% 0.0% 0.0%
pea-es 0.0% 28.6% 14.3% 14.3% 0.0% 0.0% 0.0% 14.3% 14.3% 14.3% 0.0% 0.0% 0.0%
pea-em 0.0% 33.3% 16.7% 16.7% 0.0% 0.0% 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 28.6% 0.0% 28.6% 0.0% 0.0% 0.0% 14.3% 14.3% 14.3% 0.0% 0.0% 0.0%
pea-cm 0.0% 50.0% 0.0% 16.7% 0.0% 0.0% 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 0.0%
np 0.0% 0.0% 33.3% 0.0% 0.0% 0.0% 0.0% 16.7% 33.3% 16.7% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 10: Simulative performance results of Constant synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 416 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
25
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.3
160.1
101.9
105.8
100.0
100.0
100.0
100.0
100.0
100.1
100.0
100.0
114.3
160.1
101.9
105.9
33.0
35.2
40.8
33.9
35.9
33.0
33.4
33.0
33.0
33.3
51.5
33.0
34.9
35.5
63.7
33.9
37.5
33.9
36.1
41.8
34.8
36.7
33.9
33.8
33.9
33.9
33.8
35.2
33.7
34.0
36.1
43.6
34.7
37.0
40.2
42.2
43.4
40.4
42.6
40.2
40.2
40.2
40.2
40.3
40.7
40.2
40.3
42.3
52.8
40.4
42.9
60.6
69.2
97.0
62.5
66.4
60.6
60.6
60.6
60.6
60.6
60.7
60.6
60.6
69.3
97.1
62.5
66.5
44.2
50.5
70.5
46.1
50.1
44.2
44.2
44.2
44.2
44.2
44.2
44.2
44.2
50.6
71.1
46.1
50.1
50.7
58.0
80.8
52.6
56.5
50.7
50.7
50.7
50.7
50.7
50.8
50.7
50.7
58.0
81.4
52.6
56.6
32.8
35.0
40.6
33.7
35.7
32.8
32.8
32.8
32.8
32.8
32.9
32.8
32.9
35.0
40.6
33.7
35.8
32.8
35.0
40.6
33.8
35.7
32.8
32.8
32.8
32.8
32.8
33.0
32.8
32.8
35.1
40.7
33.8
35.8
32.8
35.0
40.4
33.7
35.7
32.8
32.8
32.8
32.8
32.8
33.0
32.8
32.8
35.0
40.9
33.7
35.8
32.8
35.0
40.6
33.8
35.7
32.8
32.8
32.8
32.8
32.8
33.0
32.8
32.8
35.1
40.7
33.8
35.8
32.8
35.0
40.4
33.7
35.7
32.8
32.8
32.8
32.8
32.8
33.0
32.8
32.8
35.0
40.9
33.7
35.8
32.8
35.0
40.5
33.7
35.7
32.8
32.9
32.8
32.8
32.8
33.0
32.8
32.8
35.1
41.4
33.7
35.8
37.8
42.9
53.5
39.9
41.7
41.2
38.4
37.8
37.8
40.8
41.5
38.7
40.3
43.2
55.8
38.9
41.8
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Uniform workload simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 20.8% 56.2% 0.0% 0.0% 0.0% 0.0% 20.8% 2.1% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 2.3% 77.3% 2.3% 0.0% 0.0% 0.0% 18.2% 0.0% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 19.4% 54.8% 0.0% 0.0% 0.0% 0.0% 24.2% 1.6% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 2.0% 73.5% 0.0% 0.0% 0.0% 0.0% 22.4% 2.0% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 73.9% 0.0% 0.0% 0.0% 0.0% 19.6% 2.2% 4.3% 0.0% 0.0% 0.0%
lat-em 0.0% 4.5% 72.7% 0.0% 0.0% 0.0% 0.0% 18.2% 4.5% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 2.1% 74.5% 2.1% 0.0% 0.0% 0.0% 19.1% 0.0% 2.1% 0.0% 0.0% 0.0%
lat-cm 0.0% 0.0% 71.7% 2.2% 0.0% 0.0% 0.0% 23.9% 0.0% 0.0% 0.0% 2.2% 0.0%
bw-es 0.0% 2.3% 76.7% 2.3% 0.0% 0.0% 0.0% 18.6% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 2.3% 76.7% 2.3% 0.0% 0.0% 0.0% 18.6% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 4.7% 74.4% 2.3% 0.0% 0.0% 0.0% 16.3% 2.3% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 0.0% 70.2% 2.1% 0.0% 0.0% 0.0% 25.5% 0.0% 2.1% 0.0% 0.0% 0.0%
pea-es 0.0% 14.6% 64.6% 2.1% 0.0% 0.0% 0.0% 16.7% 2.1% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 4.4% 73.3% 0.0% 0.0% 0.0% 0.0% 17.8% 4.4% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 16.7% 56.7% 0.0% 0.0% 0.0% 0.0% 25.0% 1.7% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 4.1% 75.5% 0.0% 0.0% 0.0% 0.0% 18.4% 2.0% 0.0% 0.0% 0.0% 0.0%
np 0.0% 2.3% 76.7% 2.3% 0.0% 0.0% 0.0% 18.6% 0.0% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 11: Simulative performance results of Uniform synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 128 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
26
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.3
159.2
105.9
118.6
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.3
161.4
105.9
118.5
22.3
23.6
27.4
23.5
26.8
22.3
22.5
22.3
22.3
22.7
38.0
22.3
27.7
24.3
52.4
23.6
35.1
23.1
24.5
28.3
24.5
27.5
23.2
23.1
23.1
23.1
23.0
28.2
23.2
24.6
24.9
36.2
24.3
30.4
30.6
33.6
33.3
34.2
30.8
31.3
30.9
30.6
30.6
31.4
31.4
30.6
30.9
35.1
35.9
33.6
31.5
43.7
49.8
70.2
49.2
62.2
43.7
43.7
43.7
43.7
43.7
43.7
43.7
43.7
50.0
70.3
49.2
62.2
50.6
57.0
79.4
56.0
68.6
50.6
50.6
50.6
50.6
50.7
50.6
50.6
50.6
57.5
80.7
56.0
68.6
50.6
57.0
79.4
56.0
68.6
50.6
50.6
50.6
50.6
50.7
50.6
50.6
50.6
57.5
80.7
56.0
68.6
22.2
23.6
27.6
23.5
26.8
22.2
22.2
22.2
22.2
22.2
22.6
22.2
22.4
23.8
29.9
23.5
26.8
22.2
23.6
33.3
23.5
30.2
22.2
22.2
22.2
22.2
22.2
22.8
22.2
22.4
24.8
35.9
23.6
31.5
22.2
23.6
33.3
23.5
29.8
22.2
22.3
22.2
22.2
22.2
22.9
22.2
22.5
24.8
34.9
23.4
31.2
22.2
23.6
33.3
23.5
30.2
22.2
22.2
22.2
22.2
22.2
22.8
22.2
22.4
24.8
35.9
23.6
31.5
22.2
23.6
33.3
23.5
29.8
22.2
22.3
22.2
22.2
22.2
22.8
22.2
22.5
24.8
34.9
23.4
31.2
22.2
23.6
33.3
23.5
29.8
22.2
22.5
22.2
22.2
22.2
23.0
22.2
22.6
24.8
33.9
23.5
31.2
27.2
29.2
33.7
28.4
34.8
26.7
27.2
27.2
27.2
27.7
33.3
26.6
27.6
26.8
46.2
27.9
37.3
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Uniform workload simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 7.7% 23.1% 30.8% 0.0% 0.0% 0.0% 30.8% 7.7% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 11.1% 22.2% 22.2% 0.0% 0.0% 0.0% 22.2% 22.2% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 0.0% 35.7% 28.6% 0.0% 0.0% 0.0% 28.6% 7.1% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 10.0% 40.0% 10.0% 0.0% 0.0% 0.0% 30.0% 10.0% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 22.2% 11.1% 0.0% 0.0% 0.0% 44.4% 22.2% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 0.0% 22.2% 22.2% 0.0% 0.0% 0.0% 22.2% 11.1% 22.2% 0.0% 0.0% 0.0%
lat-cs 0.0% 0.0% 16.7% 33.3% 0.0% 0.0% 0.0% 41.7% 8.3% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 11.1% 22.2% 44.4% 0.0% 0.0% 0.0% 11.1% 11.1% 0.0% 0.0% 0.0% 0.0%
bw-es 0.0% 11.1% 33.3% 22.2% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 11.1% 33.3% 22.2% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 22.2% 11.1% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 0.0% 22.2% 22.2% 0.0% 0.0% 0.0% 22.2% 22.2% 11.1% 0.0% 0.0% 0.0%
pea-es 0.0% 9.1% 63.6% 0.0% 0.0% 0.0% 0.0% 18.2% 9.1% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 22.2% 33.3% 11.1% 0.0% 0.0% 0.0% 22.2% 11.1% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 0.0% 70.0% 0.0% 0.0% 0.0% 0.0% 10.0% 20.0% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 22.2% 33.3% 22.2% 0.0% 0.0% 0.0% 0.0% 11.1% 11.1% 0.0% 0.0% 0.0%
np 0.0% 11.1% 33.3% 22.2% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 12: Simulative performance results of Uniform synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 416 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
27
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.2
160.0
100.7
102.1
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.3
160.0
100.7
102.2
33.6
35.9
41.4
34.0
34.7
33.6
33.8
33.6
33.6
33.7
44.4
33.6
34.0
36.0
54.7
34.0
35.0
35.3
37.8
42.5
35.4
35.4
35.3
35.3
35.3
35.3
35.3
35.4
35.3
35.3
37.8
42.9
35.4
35.4
34.5
42.9
43.1
35.2
36.8
34.5
34.5
34.5
34.5
34.5
34.6
34.5
34.5
43.0
43.1
35.2
36.7
60.4
69.0
96.5
61.1
62.5
60.4
60.4
60.4
60.4
60.4
60.4
60.4
60.4
69.0
96.7
61.1
62.5
43.9
50.2
70.3
44.6
46.1
43.9
43.9
43.9
43.9
43.9
43.9
43.9
43.9
50.2
70.4
44.6
46.1
50.1
57.2
80.1
50.8
52.3
50.1
50.1
50.1
50.1
50.1
50.1
50.1
50.1
57.3
80.3
50.8
52.3
33.5
35.7
41.2
33.8
34.5
33.4
33.4
33.4
33.4
33.4
33.5
33.5
33.5
35.7
41.4
33.8
34.5
33.4
35.7
41.2
33.8
34.5
33.4
33.5
33.4
33.4
33.4
33.5
33.4
33.4
35.7
41.4
33.8
34.6
33.5
35.7
41.2
33.8
34.5
33.4
33.5
33.5
33.5
33.5
33.5
33.5
33.5
35.7
41.4
33.8
34.6
33.4
35.7
41.2
33.8
34.5
33.4
33.5
33.4
33.4
33.4
33.5
33.4
33.4
35.7
41.4
33.8
34.6
33.5
35.7
41.2
33.8
34.5
33.5
33.5
33.5
33.5
33.5
33.5
33.5
33.5
35.7
41.4
33.8
34.6
33.5
35.7
41.2
33.8
34.5
33.5
33.5
33.4
33.5
33.5
33.5
33.5
33.5
35.7
41.5
33.8
34.6
42.0
45.7
43.0
42.2
35.3
42.0
34.5
42.0
42.0
35.5
44.0
42.0
42.3
36.5
55.5
42.2
35.6
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Normal workload simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 92.6% 5.6% 0.0% 0.0% 0.0% 0.0% 1.9% 0.0% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 75.6% 3.1% 0.0% 0.0% 0.0% 0.0% 20.5% 0.8% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 93.1% 1.7% 0.0% 0.0% 0.0% 0.0% 4.6% 0.6% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 91.3% 1.7% 0.0% 0.0% 0.0% 0.0% 6.1% 0.9% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 75.6% 3.1% 0.0% 0.0% 0.0% 0.0% 20.5% 0.8% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 75.4% 3.2% 0.0% 0.0% 0.0% 0.0% 20.6% 0.8% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 92.0% 2.9% 0.0% 0.0% 0.0% 0.0% 4.3% 0.7% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 87.2% 5.5% 0.0% 0.0% 0.0% 0.0% 6.4% 0.0% 0.0% 0.0% 0.9% 0.0%
bw-es 0.0% 75.4% 3.2% 0.0% 0.0% 0.0% 0.0% 20.6% 0.8% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 75.4% 3.2% 0.0% 0.0% 0.0% 0.0% 20.6% 0.8% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 89.6% 5.7% 0.9% 0.0% 0.0% 0.0% 3.8% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 75.4% 3.2% 0.0% 0.0% 0.0% 0.0% 20.6% 0.8% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 89.1% 4.5% 0.0% 0.0% 0.0% 0.0% 5.5% 0.9% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 75.6% 3.1% 0.0% 0.0% 0.0% 0.0% 20.5% 0.8% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 91.8% 1.5% 0.0% 0.0% 0.0% 0.0% 6.0% 0.7% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 73.7% 3.6% 0.0% 0.0% 0.0% 0.0% 21.9% 0.7% 0.0% 0.0% 0.0% 0.0%
np 0.0% 75.4% 3.2% 0.0% 0.0% 0.0% 0.0% 20.6% 0.8% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 13: Simulative performance results of Normal synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 128 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
28
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
113.8
159.3
101.8
106.6
100.0
100.0
100.0
100.0
100.0
100.1
100.0
100.0
113.9
159.3
101.8
106.5
22.9
24.4
28.4
24.0
26.4
22.9
23.0
22.9
22.9
23.1
33.4
22.9
24.7
24.6
45.9
24.0
28.1
24.4
25.8
30.0
24.8
27.9
24.3
24.3
24.4
24.4
24.4
25.0
24.3
24.5
25.9
30.8
24.9
28.0
30.4
34.4
32.5
32.4
30.2
30.4
30.4
30.4
30.4
30.4
30.4
30.4
30.3
34.5
32.7
32.4
30.3
42.6
48.7
67.9
44.8
49.6
42.6
42.6
42.6
42.6
42.6
42.6
42.6
42.6
48.7
68.7
44.8
49.6
50.1
57.2
79.9
52.3
57.0
50.1
50.1
50.1
50.1
50.1
50.1
50.1
50.1
57.2
80.7
52.3
57.0
50.1
57.2
79.9
52.3
57.0
50.1
50.1
50.1
50.1
50.1
50.1
50.1
50.1
57.2
80.7
52.3
57.0
22.8
24.4
28.4
23.9
26.3
22.8
22.8
22.8
22.8
22.8
23.0
22.8
22.8
24.4
28.5
23.9
26.3
22.8
24.4
32.5
23.9
27.2
22.8
22.8
22.8
22.8
22.8
23.2
22.8
22.9
24.4
32.6
23.9
27.2
22.8
24.4
32.3
24.0
27.2
22.8
22.8
22.8
22.8
22.8
23.2
22.8
22.9
24.4
32.5
23.9
27.2
22.8
24.4
32.5
23.9
27.2
22.8
22.8
22.8
22.8
22.8
23.2
22.8
22.9
24.4
32.6
23.9
27.2
22.8
24.4
32.3
24.0
27.2
22.8
22.9
22.8
22.8
22.8
23.2
22.8
22.9
24.4
32.5
23.9
27.2
22.8
24.4
32.3
24.0
27.2
22.8
22.9
22.8
22.8
22.9
23.2
22.8
22.9
24.4
32.6
24.0
27.2
28.0
24.8
41.5
28.9
32.0
29.8
31.6
28.0
28.0
22.9
24.5
22.9
23.9
33.8
44.0
30.1
32.6
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Normal workload simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 19.2% 65.4% 0.0% 0.0% 0.0% 0.0% 11.5% 3.8% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 12.5% 62.5% 0.0% 0.0% 0.0% 0.0% 8.3% 16.7% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 25.7% 51.4% 0.0% 0.0% 0.0% 0.0% 20.0% 2.9% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 14.8% 51.9% 11.1% 0.0% 0.0% 0.0% 18.5% 3.7% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 9.1% 50.0% 9.1% 0.0% 0.0% 0.0% 4.5% 18.2% 9.1% 0.0% 0.0% 0.0%
lat-em 0.0% 9.1% 68.2% 9.1% 0.0% 0.0% 0.0% 13.6% 0.0% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 8.7% 56.5% 8.7% 0.0% 0.0% 0.0% 8.7% 17.4% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 9.1% 68.2% 13.6% 0.0% 0.0% 0.0% 9.1% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-es 0.0% 9.1% 54.5% 4.5% 0.0% 0.0% 0.0% 13.6% 18.2% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 9.1% 54.5% 4.5% 0.0% 0.0% 0.0% 13.6% 18.2% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 4.0% 64.0% 4.0% 0.0% 0.0% 0.0% 20.0% 8.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 8.7% 69.6% 0.0% 0.0% 0.0% 0.0% 13.0% 8.7% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 23.1% 61.5% 3.8% 0.0% 0.0% 0.0% 11.5% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 0.0% 79.2% 0.0% 0.0% 0.0% 0.0% 12.5% 8.3% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 21.9% 53.1% 0.0% 0.0% 0.0% 0.0% 21.9% 3.1% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 4.3% 82.6% 4.3% 0.0% 0.0% 0.0% 4.3% 4.3% 0.0% 0.0% 0.0% 0.0%
np 0.0% 9.1% 54.5% 4.5% 0.0% 0.0% 0.0% 13.6% 18.2% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 14: Simulative performance results of Normal synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 416 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
29
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
113.3
158.5
101.3
105.7
100.0
100.0
100.0
100.0
100.0
100.1
100.0
100.0
113.5
159.2
101.3
105.8
32.9
35.0
40.4
33.9
36.1
32.8
33.3
32.9
32.9
33.2
52.0
32.8
35.0
35.3
64.3
33.9
37.9
33.5
35.5
40.9
34.4
36.8
33.3
33.5
33.4
33.4
33.4
35.1
33.3
33.6
35.6
43.6
34.5
37.0
40.0
41.6
53.1
40.9
43.2
40.0
40.0
40.0
40.0
40.0
40.7
40.0
40.0
42.2
51.1
41.3
43.3
61.1
69.8
97.5
63.2
67.6
61.1
61.1
61.1
61.1
61.1
61.1
61.1
61.1
69.9
98.2
63.2
67.7
43.8
50.0
70.2
45.9
50.3
43.8
43.8
43.8
43.8
43.8
43.8
43.8
43.8
50.2
70.3
45.9
50.4
51.7
59.1
82.4
53.8
58.2
51.7
51.7
51.7
51.7
51.7
51.7
51.7
51.7
59.1
83.1
53.8
58.2
32.8
35.0
40.2
33.7
35.9
32.8
32.6
32.8
32.8
32.7
32.8
32.7
32.7
34.9
40.9
33.7
36.0
32.7
34.8
40.4
33.7
36.0
32.7
32.7
32.7
32.7
32.7
32.8
32.7
32.6
34.9
41.0
33.7
36.0
32.6
34.8
40.2
33.7
35.9
32.6
32.7
32.6
32.6
32.7
32.8
32.7
32.7
34.8
41.0
33.7
36.0
32.7
34.8
40.4
33.7
36.0
32.7
32.7
32.7
32.7
32.7
32.8
32.7
32.6
34.9
41.0
33.7
36.0
32.6
34.8
40.2
33.7
35.9
32.6
32.7
32.6
32.6
32.7
32.8
32.7
32.7
34.8
41.0
33.7
36.0
32.7
34.8
40.2
33.7
35.9
32.6
32.8
32.6
32.6
32.7
32.9
32.7
32.7
34.9
41.0
33.7
36.0
40.1
42.3
52.0
40.7
41.9
39.1
38.8
40.1
40.1
38.8
41.2
39.3
39.5
42.3
53.1
40.4
42.1
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Exponential workload simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 4.7% 72.1% 0.0% 0.0% 0.0% 0.0% 18.6% 4.7% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 0.0% 73.2% 2.4% 0.0% 0.0% 0.0% 22.0% 2.4% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 9.3% 66.7% 1.9% 0.0% 0.0% 0.0% 22.2% 0.0% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 4.7% 72.1% 0.0% 0.0% 0.0% 0.0% 20.9% 2.3% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 5.0% 70.0% 2.5% 0.0% 0.0% 0.0% 20.0% 2.5% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 2.5% 72.5% 0.0% 0.0% 0.0% 0.0% 20.0% 2.5% 0.0% 0.0% 2.5% 0.0%
lat-cs 0.0% 0.0% 73.8% 0.0% 0.0% 0.0% 0.0% 21.4% 4.8% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 0.0% 71.8% 2.6% 0.0% 0.0% 0.0% 23.1% 2.6% 0.0% 0.0% 0.0% 0.0%
bw-es 0.0% 7.3% 68.3% 0.0% 0.0% 0.0% 0.0% 19.5% 2.4% 0.0% 0.0% 2.4% 0.0%
bw-em 0.0% 7.3% 68.3% 0.0% 0.0% 0.0% 0.0% 19.5% 2.4% 0.0% 0.0% 2.4% 0.0%
bw-cs 0.0% 5.0% 72.5% 2.5% 0.0% 0.0% 0.0% 20.0% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 0.0% 72.5% 0.0% 0.0% 0.0% 0.0% 22.5% 2.5% 0.0% 0.0% 0.0% 2.5%
pea-es 0.0% 9.3% 72.1% 2.3% 0.0% 0.0% 0.0% 16.3% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 9.5% 69.0% 0.0% 0.0% 0.0% 0.0% 19.0% 2.4% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 5.8% 67.3% 0.0% 0.0% 0.0% 0.0% 25.0% 1.9% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 7.0% 72.1% 2.3% 0.0% 0.0% 0.0% 18.6% 0.0% 0.0% 0.0% 0.0% 0.0%
np 0.0% 7.3% 68.3% 0.0% 0.0% 0.0% 0.0% 19.5% 2.4% 0.0% 0.0% 2.4% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 15: Simulative performance results of Exponential synthetic workload without (denoted
with np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling tech-
niques on 128 cores of miniHPC. Percent performance improvement normalized to STATIC in np
scenario (baseline case without any perturbations and baseline load balancing method). White,
red, and blue denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%),
respectively. The table shows the DLS techniques dynamically selected by SimAS during execution.
30
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.0
159.9
106.4
120.2
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.6
161.0
106.4
120.2
22.0
23.4
26.6
23.2
25.9
21.8
22.0
21.8
22.0
22.3
36.8
21.8
27.7
23.9
51.9
23.2
33.9
22.2
23.9
29.0
23.8
26.9
22.9
22.7
22.2
22.2
22.4
27.9
22.5
24.4
24.0
38.9
23.6
29.4
30.4
32.9
37.7
32.9
31.3
30.4
31.0
30.4
30.4
30.1
32.1
31.2
30.5
34.3
35.3
32.7
36.9
42.6
48.5
67.0
47.9
62.8
42.6
42.6
42.6
42.6
42.6
42.8
42.6
42.6
48.9
69.4
47.9
62.8
53.4
60.8
84.5
59.9
73.6
53.4
53.4
53.4
53.4
53.4
53.6
53.4
53.4
61.3
86.8
59.8
73.6
53.4
60.8
84.5
59.9
73.6
53.4
53.4
53.4
53.4
53.4
53.6
53.4
53.4
61.3
86.8
59.8
73.6
21.7
23.2
26.0
23.1
26.1
21.7
21.9
21.7
21.7
21.7
22.6
21.8
22.0
23.2
27.9
22.9
27.1
23.0
24.3
35.5
23.6
30.3
23.0
22.5
23.0
23.0
22.1
23.8
22.8
22.0
25.4
35.3
25.3
31.8
23.0
25.3
35.5
23.1
30.3
23.0
22.1
23.0
23.0
22.1
23.8
22.6
22.2
26.3
34.0
25.3
30.7
23.0
24.3
35.5
23.6
30.3
23.0
22.5
23.0
23.0
22.1
23.8
22.8
21.9
25.4
35.3
25.3
31.8
23.0
25.3
35.5
23.1
30.3
23.0
21.8
23.0
23.0
22.1
23.8
22.6
22.2
26.3
34.0
25.3
30.7
23.0
25.3
35.5
23.1
30.3
22.5
22.9
23.0
23.0
22.1
24.2
22.6
23.7
26.3
34.9
25.3
30.7
30.9
23.6
38.3
25.0
29.8
30.9
24.5
30.9
30.9
22.9
35.1
35.8
25.1
33.6
38.9
29.4
81.9
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Exponential workload simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 0.0% 3.8% 11.5% 0.0% 0.0% 0.0% 73.1% 11.5% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 10.0% 30.0% 0.0% 0.0% 0.0% 0.0% 40.0% 20.0% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 18.2% 54.5% 9.1% 0.0% 0.0% 0.0% 9.1% 9.1% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 0.0% 8.3% 25.0% 0.0% 0.0% 0.0% 50.0% 16.7% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 25.0% 37.5% 0.0% 0.0% 0.0% 25.0% 12.5% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 9.1% 9.1% 9.1% 0.0% 0.0% 0.0% 54.5% 18.2% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 0.0% 8.3% 8.3% 0.0% 0.0% 0.0% 66.7% 16.7% 0.0% 0.0% 0.0% 0.0%
lat-cm 0.0% 12.5% 37.5% 25.0% 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-es 0.0% 10.0% 10.0% 10.0% 0.0% 0.0% 0.0% 50.0% 20.0% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 10.0% 10.0% 10.0% 0.0% 0.0% 0.0% 50.0% 20.0% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 25.0% 25.0% 37.5% 0.0% 0.0% 0.0% 0.0% 12.5% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 10.0% 10.0% 20.0% 0.0% 0.0% 0.0% 40.0% 20.0% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 11.1% 44.4% 22.2% 0.0% 0.0% 0.0% 22.2% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 12.5% 50.0% 12.5% 0.0% 0.0% 0.0% 25.0% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 15.4% 30.8% 7.7% 0.0% 0.0% 0.0% 38.5% 7.7% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 12.5% 62.5% 0.0% 0.0% 0.0% 0.0% 12.5% 12.5% 0.0% 0.0% 0.0% 0.0%
np 0.0% 0.0% 10.0% 10.0% 0.0% 0.0% 0.0% 60.0% 20.0% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 16: Simulative performance results of Exponential synthetic workload without (denoted
with np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling tech-
niques on 416 cores of miniHPC. Percent performance improvement normalized to STATIC in np
scenario (baseline case without any perturbations and baseline load balancing method). White,
red, and blue denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%),
respectively. The table shows the DLS techniques dynamically selected by SimAS during execution.
31
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.3
159.8
101.9
105.9
100.0
100.0
100.0
100.0
100.0
100.1
100.0
100.0
114.3
160.5
101.9
106.0
32.6
34.8
40.1
33.6
35.6
32.6
33.0
32.6
32.6
32.9
51.0
32.6
34.6
35.0
63.3
33.5
37.2
33.3
35.6
40.8
34.4
36.3
33.2
33.3
33.3
33.3
33.4
34.9
33.4
33.3
35.4
43.4
34.4
36.3
39.7
41.6
50.7
39.1
42.3
39.7
40.0
39.7
39.7
39.7
36.0
39.7
39.7
41.7
50.7
39.2
42.4
59.3
67.8
94.7
61.2
65.2
59.3
59.3
59.3
59.3
59.3
59.4
59.3
59.3
67.8
95.1
61.2
65.3
42.6
48.7
68.0
44.5
48.5
42.6
42.6
42.6
42.6
42.6
42.7
42.6
42.6
48.7
68.7
44.5
48.6
50.5
57.7
80.5
52.4
56.5
50.5
50.5
50.5
50.5
50.5
50.5
50.5
50.5
57.8
81.1
52.4
56.5
32.5
34.7
40.0
33.5
35.4
32.5
32.5
32.5
32.5
32.4
32.6
32.5
32.5
34.6
40.6
33.4
35.4
32.4
34.6
39.9
33.4
35.4
32.4
32.6
32.4
32.4
32.4
32.7
32.4
32.4
34.7
40.5
33.4
35.5
32.4
34.6
39.9
33.4
35.4
32.4
32.5
32.4
32.4
32.5
32.7
32.4
32.5
34.6
40.7
33.4
35.4
32.4
34.6
39.9
33.4
35.4
32.4
32.6
32.4
32.4
32.4
32.7
32.4
32.4
34.7
40.5
33.4
35.5
32.4
34.6
39.9
33.4
35.4
32.4
32.5
32.4
32.4
32.5
32.7
32.4
32.5
34.6
40.5
33.4
35.4
32.4
34.6
39.9
33.4
35.4
32.4
32.7
32.4
32.4
32.5
33.0
32.4
32.5
34.6
40.8
33.4
35.4
39.8
42.1
52.1
38.5
41.7
38.9
38.2
39.8
39.8
39.7
39.8
40.5
40.4
42.7
54.8
39.6
42.1
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Gamma workload simulative performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 21.3% 57.4% 0.0% 0.0% 0.0% 0.0% 19.1% 2.1% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 4.5% 75.0% 4.5% 0.0% 0.0% 0.0% 15.9% 0.0% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 18.3% 55.0% 1.7% 0.0% 0.0% 0.0% 23.3% 1.7% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 6.2% 72.9% 0.0% 0.0% 0.0% 0.0% 18.8% 2.1% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 71.1% 2.2% 0.0% 0.0% 0.0% 24.4% 0.0% 2.2% 0.0% 0.0% 0.0%
lat-em 0.0% 4.4% 71.1% 2.2% 0.0% 0.0% 0.0% 20.0% 0.0% 0.0% 2.2% 0.0% 0.0%
lat-cs 0.0% 0.0% 75.0% 2.3% 0.0% 0.0% 0.0% 18.2% 2.3% 0.0% 2.3% 0.0% 0.0%
lat-cm 0.0% 0.0% 75.0% 0.0% 0.0% 0.0% 0.0% 20.5% 2.3% 0.0% 0.0% 2.3% 0.0%
bw-es 0.0% 6.8% 72.7% 0.0% 0.0% 0.0% 0.0% 18.2% 2.3% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 6.8% 72.7% 0.0% 0.0% 0.0% 0.0% 18.2% 2.3% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 2.4% 78.6% 2.4% 0.0% 0.0% 0.0% 16.7% 0.0% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 4.5% 72.7% 2.3% 0.0% 0.0% 0.0% 20.5% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 12.8% 61.7% 0.0% 0.0% 0.0% 0.0% 19.1% 4.3% 2.1% 0.0% 0.0% 0.0%
pea-em 0.0% 2.3% 74.4% 2.3% 0.0% 0.0% 0.0% 18.6% 2.3% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 10.3% 62.1% 0.0% 0.0% 0.0% 0.0% 25.9% 1.7% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 4.3% 72.3% 4.3% 0.0% 0.0% 0.0% 19.1% 0.0% 0.0% 0.0% 0.0% 0.0%
np 0.0% 6.8% 72.7% 0.0% 0.0% 0.0% 0.0% 18.2% 2.3% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 17: Simulative performance results of Gamma synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 128 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
32
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
pea-em
pea-es
bw-cm
bw-cs
bw-em
bw-es
lat-cm
lat-cs
lat-em
lat-es
all-cm
all-cs
all-em
all-es
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
114.0
159.2
106.1
119.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
114.6
160.8
106.1
119.0
22.2
23.6
27.5
23.5
26.6
22.2
22.6
22.2
22.2
22.7
37.0
22.3
27.9
24.2
53.1
23.6
35.5
23.0
24.8
28.2
24.2
27.8
23.4
23.4
22.9
22.9
23.0
28.6
22.9
24.6
24.7
37.1
24.1
30.9
31.1
33.4
34.2
33.1
33.0
31.0
30.8
31.0
31.1
31.4
31.9
31.1
31.2
33.2
36.5
34.3
31.9
43.1
49.2
68.1
48.4
62.1
43.0
43.0
43.0
43.0
43.1
43.2
43.1
43.1
49.3
70.4
48.2
62.0
52.7
60.2
84.0
58.8
71.7
52.7
52.7
52.7
52.7
52.7
52.7
52.7
52.7
60.3
85.9
58.7
71.7
52.7
60.2
84.0
58.8
71.7
52.7
52.7
52.7
52.7
52.7
52.7
52.7
52.7
60.3
85.9
58.7
71.7
22.3
23.6
27.9
23.5
26.8
22.1
22.4
22.3
22.3
22.2
22.7
22.3
22.6
24.0
29.2
23.5
26.8
22.3
23.7
34.2
24.5
30.5
22.3
22.3
22.3
22.3
22.3
23.2
22.3
22.4
24.6
36.5
24.4
31.9
22.1
23.6
34.2
24.5
30.5
22.2
22.5
22.1
22.1
22.2
23.0
22.1
22.5
23.8
35.3
24.4
31.9
22.3
23.7
34.2
24.5
30.5
22.3
22.3
22.3
22.3
22.1
22.8
22.3
22.4
24.6
36.5
24.4
31.9
22.2
23.6
34.2
24.5
30.5
22.2
22.3
22.2
22.2
22.2
22.9
22.1
22.4
23.8
35.3
24.4
31.9
22.2
23.6
34.2
24.5
30.5
22.2
22.8
22.1
22.1
22.2
23.2
22.2
22.9
23.8
36.8
24.4
31.9
28.2
25.2
38.4
23.5
33.6
26.3
26.3
28.2
28.2
28.5
27.6
22.2
27.0
24.6
37.1
23.5
29.7
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Gamma workload simulative performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF
all-es 0.0% 10.0% 30.0% 30.0% 0.0% 0.0% 0.0% 20.0% 10.0% 0.0% 0.0% 0.0% 0.0%
all-em 0.0% 22.2% 33.3% 22.2% 0.0% 0.0% 0.0% 11.1% 11.1% 0.0% 0.0% 0.0% 0.0%
all-cs 0.0% 0.0% 30.8% 15.4% 0.0% 0.0% 0.0% 38.5% 15.4% 0.0% 0.0% 0.0% 0.0%
all-cm 0.0% 11.1% 44.4% 11.1% 0.0% 0.0% 0.0% 11.1% 22.2% 0.0% 0.0% 0.0% 0.0%
lat-es 0.0% 0.0% 22.2% 22.2% 0.0% 0.0% 0.0% 44.4% 11.1% 0.0% 0.0% 0.0% 0.0%
lat-em 0.0% 0.0% 25.0% 25.0% 0.0% 0.0% 0.0% 37.5% 12.5% 0.0% 0.0% 0.0% 0.0%
lat-cs 0.0% 0.0% 22.2% 11.1% 0.0% 0.0% 0.0% 44.4% 11.1% 0.0% 0.0% 0.0% 11.1%
lat-cm 0.0% 0.0% 22.2% 22.2% 0.0% 0.0% 0.0% 33.3% 11.1% 11.1% 0.0% 0.0% 0.0%
bw-es 0.0% 0.0% 33.3% 11.1% 0.0% 0.0% 0.0% 22.2% 33.3% 0.0% 0.0% 0.0% 0.0%
bw-em 0.0% 0.0% 33.3% 11.1% 0.0% 0.0% 0.0% 22.2% 33.3% 0.0% 0.0% 0.0% 0.0%
bw-cs 0.0% 22.2% 22.2% 22.2% 0.0% 0.0% 0.0% 22.2% 11.1% 0.0% 0.0% 0.0% 0.0%
bw-cm 0.0% 11.1% 44.4% 0.0% 0.0% 0.0% 0.0% 33.3% 11.1% 0.0% 0.0% 0.0% 0.0%
pea-es 0.0% 20.0% 50.0% 20.0% 0.0% 0.0% 0.0% 10.0% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-em 0.0% 11.1% 44.4% 22.2% 0.0% 0.0% 0.0% 22.2% 0.0% 0.0% 0.0% 0.0% 0.0%
pea-cs 0.0% 9.1% 63.6% 0.0% 0.0% 0.0% 0.0% 9.1% 18.2% 0.0% 0.0% 0.0% 0.0%
pea-cm 0.0% 12.5% 37.5% 0.0% 0.0% 0.0% 0.0% 12.5% 37.5% 0.0% 0.0% 0.0% 0.0%
np 0.0% 0.0% 33.3% 11.1% 0.0% 0.0% 0.0% 22.2% 33.3% 0.0% 0.0% 0.0% 0.0%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 18: Simulative performance results of Gamma synthetic workload without (denoted with
np) and with (the rest) perturbations using SimAS and other thirteen loop scheduling techniques
on 416 cores of miniHPC. Percent performance improvement normalized to STATIC in np scenario
(baseline case without any perturbations and baseline load balancing method). White, red, and blue
denote baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively.
The table shows the DLS techniques dynamically selected by SimAS during execution.
33
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
lat-cm
lat-cs
pea+lat-cm
pea+lat-cs
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
104.7
126.1
101.6
104.9
104.8
130.6
33.5
33.6
34.3
71.5
73.7
72.1
79.3
44.9
46.4
55.3
45.3
47.9
40.9
54.3
33.2
35.9
37.3
33.3
45.7
36.1
48.8
57.8
60.6
71.1
57.9
61.6
60.6
74.9
43.5
44.7
52.9
43.7
47.4
44.8
57.5
50.4
52.2
62.2
51.0
54.7
52.3
66.2
27.1
27.6
28.7
27.3
35.1
27.6
36.3
27.4
27.8
28.9
27.1
34.6
27.6
35.7
27.3
28.0
29.0
27.4
36.3
27.7
36.9
27.4
27.7
28.8
27.0
34.5
27.6
35.3
27.4
28.0
29.1
27.2
36.3
27.6
36.9
26.5
27.0
27.8
26.8
34.4
27.2
35.4
47.4
42.9
54.7
59.1
80.3
36.1
45.5
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) PSIA native performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF SimASoverhead
pea+lat-cs 0.0 0.0 55.0 12.6 0.0 0.0 0.0 32.5 0.0 0.0 0.0 0.0 0.0 0.4
pea+lat-cm 0.0 0.0 32.3 16.7 0.0 0.0 0.0 31.2 14.6 2.1 3.1 0.0 0.0 0.7
lat-cs 0.0 0.0 5.1 11.8 0.0 0.0 0.0 19.9 42.6 1.5 18.4 0.7 0.0 0.4
lat-cm 0.0 0.0 1.7 7.6 0.0 0.0 0.0 23.5 35.3 12.6 13.4 3.4 2.5 0.5
pea-cs 0.0 0.0 4.1 18.2 0.0 0.0 0.0 13.2 33.9 12.4 8.3 5.0 5.0 0.6
pea-cm 0.0 0.0 0.8 16.7 0.0 0.0 0.0 24.2 27.5 18.3 5.0 4.2 3.3 0.6
np 0.0 0.0 4.5 17.0 0.0 0.0 0.0 18.8 24.1 13.4 7.1 11.6 3.6 0.5
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 19: Native performance results of PSIA without (denoted with np) and with (the rest)
perturbations using SimAS and other thirteen loop scheduling techniques on miniHPC. Percent
performance improvement normalized to STATIC in np scenario (baseline case without any per-
turbations and baseline load balancing method). White, red, and blue denote baseline (= 100%),
degraded (> 100%), and improved performance (< 100%), respectively. The table shows the DLS
techniques dynamically selected by SimAS and the percent of execution time spent in SimAS calls.
34
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
lat-cm
lat-cs
pea+lat-cm
pea+lat-cs
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
108.7
127.8
100.3
106.1
109.0
137.2
14.8
15.0
15.4
35.4
36.6
35.7
37.5
19.1
20.5
23.2
19.6
41.2
20.8
47.5
20.6
21.9
24.7
21.1
36.0
22.3
36.1
42.3
46.0
54.0
43.3
49.5
47.0
61.7
51.0
56.0
65.5
51.5
57.1
56.3
72.5
51.0
55.7
65.2
51.4
57.1
56.3
72.5
11.8
12.1
12.8
12.3
25.0
12.8
25.7
12.8
13.2
13.9
13.7
24.6
13.9
26.5
13.1
13.3
14.1
13.7
24.9
13.9
26.7
12.8
13.2
14.0
13.6
24.7
13.7
26.2
13.1
13.3
14.1
14.2
24.8
18.3
26.3
11.3
11.5
12.0
13.0
24.3
13.2
25.2
32.7
34.1
35.1
37.4
49.8
37.4
51.4
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) PSIA native performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF SimASoverhead
pea+lat-cs 0.0 0.0 54.2 0.0 0.0 0.0 0.0 45.8 0.0 0.0 0.0 0.0 0.0 0.1
pea+lat-cm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.1
lat-cs 0.0 0.0 38.5 0.0 0.0 0.0 0.0 61.5 0.0 0.0 0.0 0.0 0.0 0.1
lat-cm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.1
pea-cs 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0 0.0 0.0 50.0 0.1
pea-cm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0 0.0 0.0 50.0 0.1
np 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0 0.0 0.0 50.0 0.1
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 20: Native performance results of PSIA without (denoted with np) and with (the rest)
perturbations using SimAS and other thirteen loop scheduling techniques on 416 cores of miniHPC.
Percent performance improvement normalized to STATIC in np scenario (baseline case without any
perturbations and baseline load balancing method). White, red, and blue denote baseline (= 100%),
degraded (> 100%), and improved performance (< 100%), respectively. The table shows the DLS
techniques dynamically selected by SimAS and the percent of execution time spent in SimAS calls.
35
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
lat-cm
lat-cs
pea+lat-cm
pea+lat-cs
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
104.2
133.5
99.5
103.2
104.1
138.7
26.0
26.5
27.6
48.9
54.8
52.8
58.6
32.6
33.2
34.7
33.3
55.5
34.2
58.5
35.1
35.8
39.7
35.3
48.0
36.3
51.2
63.1
66.4
83.6
62.8
66.1
66.3
88.5
47.4
49.4
63.1
46.5
50.2
49.4
67.9
53.2
56.6
72.1
52.9
56.6
56.6
77.8
26.4
26.7
30.8
26.5
32.3
26.7
33.6
31.1
29.6
34.6
38.4
56.1
50.5
61.9
28.7
33.0
29.3
43.3
86.7
45.0
44.8
27.8
28.3
35.1
30.2
39.5
31.8
44.3
26.0
43.3
35.7
27.2
38.8
28.3
40.4
52.9
42.2
34.8
66.2
63.6
68.8
66.4
29.2
31.6
36.3
32.5
40.8
34.8
42.2
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Mandelbrot native performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF SimASoverhead
pea+lat-cs 0.0 0.0 11.8 36.5 0.0 0.0 0.0 38.8 7.1 0.0 5.9 0.0 0.0 0.2
pea+lat-cm 0.0 0.0 22.5 21.3 0.0 0.0 0.0 23.6 11.2 0.0 7.9 0.0 13.5 0.3
lat-cs 0.0 0.0 1.1 12.5 0.0 0.0 0.0 40.9 30.7 0.0 14.8 0.0 0.0 0.3
lat-cm 0.0 0.0 34.4 19.8 0.0 0.0 0.0 20.8 3.1 1.0 7.3 0.0 13.5 0.3
pea-cs 0.0 0.0 33.3 13.3 0.0 0.0 0.0 23.3 12.2 3.3 3.3 2.2 8.9 0.3
pea-cm 0.0 0.0 32.5 16.9 0.0 0.0 0.0 28.9 9.6 3.6 0.0 1.2 7.2 0.3
np 0.0 0.0 35.6 19.5 0.0 0.0 0.0 23.0 4.6 0.0 8.0 0.0 9.2 0.4
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 21: Native performance results of Mandelbrot without (denoted with np) and with (the
rest) perturbations using SimAS and other thirteen loop scheduling techniques on 128 cores of
miniHPC. Percent performance improvement normalized to STATIC in np scenario (baseline case
without any perturbations and baseline load balancing method). White, red, and blue denote
baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively. Each
table shows the DLS techniques dynamically selected by SimAS and the percent of execution time
spent in SimAS calls.
36
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
lat-cm
lat-cs
pea+lat-cm
pea+lat-cs
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
109.0
128.7
101.6
111.2
110.7
141.1
18.2
18.6
19.6
37.6
37.7
38.3
40.3
22.8
23.3
26.3
38.6
46.4
40.6
53.2
36.6
39.8
43.5
40.0
54.5
42.7
62.1
69.2
74.7
86.7
70.7
80.3
76.2
103.3
81.4
88.9
105.0
83.2
91.9
90.7
117.5
81.3
88.8
105.0
83.2
91.9
90.7
117.7
21.7
22.7
26.7
22.6
38.3
24.5
42.0
22.5
22.2
24.6
133.4
41.0
125.6
45.0
19.7
20.0
22.9
87.2
39.2
85.6
41.6
24.2
24.1
27.2
163.1
39.8
209.4
44.1
21.0
20.6
21.9
104.8
38.6
135.4
41.1
44.6
65.7
74.6
58.6
82.4
53.3
72.9
28.0
29.8
31.7
35.3
65.7
34.7
54.2
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Mandelbrot native performance on 416 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF SimASoverhead
pea+lat-cs 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.2
pea+lat-cm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.1
lat-cs 0.0 0.0 30.4 7.1 0.0 0.0 0.0 50.0 8.9 0.0 3.6 0.0 0.0 0.3
lat-cm 0.0 0.0 0.0 30.0 0.0 0.0 0.0 70.0 0.0 0.0 0.0 0.0 0.0 0.3
pea-cs 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.3
pea-cm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.2
np 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.3
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 22: Native performance results of Mandelbrot without (denoted with np) and with (the
rest) perturbations using SimAS and other thirteen loop scheduling techniques on 416 cores of
miniHPC. Percent performance improvement normalized to STATIC in np scenario (baseline case
without any perturbations and baseline load balancing method). White, red, and blue denote
baseline (= 100%), degraded (> 100%), and improved performance (< 100%), respectively. Each
table shows the DLS techniques dynamically selected by SimAS and the percent of execution time
spent in SimAS calls.
A targeted selection of native experiments have been conducted for PSIA and Mandel-
brot. The constant distribution of perturbation values was selected, as it significantly impacts
the applications performance. Perturbations in the network bandwidth were excluded from
native experimentation due to their minimal impact on performance (as shown above). The
performance results of PSIA and Mandelbrot with the thirteen DLS techniques under pertur-
bations is shown in Figures 19 -22. Similar to the simulation-based predictions, the nonadap-
tive DLS techniques perform poorly on the perturbed heterogeneous system. In particular,
STATIC, GSS, TSS, and FAC are highly affected by all considered perturbations. Unlike in the
simulation-based predictions, STATIC is also slightly affected by latency perturbations. This
is due to the fact that STATIC is implemented in the DLS4LB in a self-scheduling manner,
i.e., workers obtain chunks of loop iterations during execution when they become free. The
chunk size of STATIC is equal to the total number of loop iterations divided by the number of
worker processes. Therefore, each worker obtains exactly one chunk. The adaptive techniques
resulted in comparable performance. However, in certain cases, AWF-E performed poorly
37
in latency perturbations scenarios. Similar to the simulation-based predictions, the AWF-B
outperforms all other techniques in most the execution scenarios. The SimAS results in the
shortest execution time in most of the cases, especially for PSIA. The application performance
with SimAS degraded in certain cases due to the non-preemptive scheduling implementation.
Even though the technique with the best performance is selected upon a new call to SimAS ,
the execution of already scheduled loop iterations can not be preempted to be resumed with
the newly selected DLS.
To show the applicability of SimAS approach to scientific applications, time-stepping ver-
sions of PSIA and Mandelbrot are also executed under perturbations with and without SimAS .
In time-stepping applications, i.e., PSIA TS and Mandelbrot TS, SimAS starts a new simula-
tion at the beginning of each time step. WF is used as the default DLS technique in these exper-
iments or the same DLS from the previous time-steps until the simulations are finished. SimAS
selects the best performing DLS techniques based on the prediction from simulations for the
current time-step. This represents another use-case of SimAS in time-stepping applications,
which is frequently encountered in scientific applications. The results of the time-stepping
applications are shown in Figure 23 and Figure 24. Similar to the non-time-stepping versions,
SimAS improved the performance of applications in most of the cases. We note that no single
DLS technique always achieves the best performance. Therefore, a dynamic selection of the
DLS technique according to the current perturbations in the system is needed. The SimAS
overhead is, in general, below 0.5% of the execution time, except for PSIA TS, which has the
overhead of 2.7% at the most. This is due to the short execution time of the time-stepping
version of the PSIA compared to the non-time-stepping version.
5.3 Discussion
Even though the applications considered are computationally-intensive and only communi-
cate loop indices with the master, perturbations in network latency had a significant impact
on performance. The implementation choice of the scheduling techniques, such as STATIC,
implemented in a self-scheduling fashion, led to degrading its performance in scenarios with
network perturbations. In most experiments, all the adaptive DLS techniques perform compa-
rably. However, in certain instances, e.g., AWF-C and AF in Figure 21 in lat-cm and lat-cs,
their performance was significantly poorer compared to other adaptive DLS techniques. This
poor performance is due to the short execution time of the Mandelbrot application and the
high variability of the loop iteration execution times, in addition to the added perturbations,
which does not allow the core weights learned by these techniques to converge to the correct
value.
Selecting the most performing DLS technique before execution might not deliver the best
performance, as perturbations in the HPC system are unknown a priori. For instance, the best
DLS technique for Mandelbrot that could be identified before execution, i.e., in np execution
scenario, is SS, which is outperformed by SimAS in lat-cs and pea+lat-cs in Figure 21. A simi-
lar change in the best DLS technique is observed in the results of Mandelbrot TS in Figure 24.
Since there is no high load imbalance in the PSIA or PSIA TS, there is no high variation in
the performance of different DLS techniques. Since the best DLS technique can not be known
before execution, SimAS improved the performance by dynamically selecting the DLS with the
best performance based on the simulation predictions.
In general, DLS techniques are designed to be efficient. However, efficiency prevents ro-
bustness due to the low tolerance of efficient techniques to uncertain events. Uncertainty is
ineradicable, and it manifests in HPC systems as perturbations. This highlights the impor-
tance of the careful choice of DLS techniques for each application, system size, and execution
scenario. Dynamic selection of DLS techniques ensures that each DLS technique is employed
38
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
lat-cm
lat-cs
pea+lat-cm
pea+lat-cs
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
104.7
110.5
109.1
189.1
112.7
250.1
32.4
32.7
32.6
66.2
99.0
67.0
104.8
101.8
105.9
112.9
108.6
189.1
71.5
169.9
38.3
41.1
40.8
57.7
96.7
60.1
113.3
55.8
58.7
63.2
65.5
118.1
68.2
131.6
52.7
55.3
59.6
58.2
115.3
61.2
127.5
52.6
55.2
59.8
63.2
114.7
65.8
127.5
33.5
33.8
33.0
56.3
63.5
58.3
104.0
35.4
35.3
35.7
58.4
102.8
58.8
104.4
34.3
34.3
34.0
59.7
94.3
60.6
105.5
35.3
35.5
35.3
58.4
102.8
59.0
104.5
33.9
34.2
34.1
59.6
91.6
60.5
105.4
29.2
29.5
29.7
55.2
60.4
56.5
64.2
36.0
37.1
36.9
58.5
66.2
60.3
108.9
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) PSIA TS native performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF SimASoverhead
pea+lat-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.7%
pea+lat-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 86.2% 13.8% 0.0% 0.0% 0.0% 0.0% 2.0%
lat-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.5%
lat-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.7%
pea-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.3% 5.7% 0.0% 0.0% 0.0% 0.0% 2.7%
pea-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 93.5% 6.5% 0.0% 0.0% 0.0% 0.0% 2.7%
np 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.2% 3.8% 0.0% 0.0% 0.0% 0.0% 2.7%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 23: Native performance results of PSIA TS without (denoted with np) and with (the rest)
perturbations using SimAS and other thirteen loop scheduling techniques on miniHPC. Percent
performance improvement normalized to STATIC in np scenario (baseline case without any per-
turbations and baseline load balancing method). White, red, and blue denote baseline (= 100%),
degraded (> 100%), and improved performance (< 100%), respectively. Each table shows the DLS
techniques dynamically selected by SimAS and the percent of execution time spent in SimAS calls.
39
S
T
A
T
IC S
S
FS
C
m
FS
C
G
S
S
T
S
S
FA
C
W
F
A
W
F-
B
A
W
F-
C
A
W
F-
D
A
W
F-
E
A
F
S
im
A
S
np
pea-cm
pea-cs
lat-cm
lat-cs
pea+lat-cm
pea+lat-cs
E
x
e
cu
ti
o
n
 s
ce
n
a
ri
o
100.0
104.2
113.4
102.7
176.5
105.3
176.6
29.5
30.0
31.1
61.3
63.0
62.1
66.1
45.1
46.4
48.5
49.9
93.1
51.3
100.6
52.2
53.6
56.7
55.9
98.4
57.9
117.7
85.5
88.6
97.4
86.1
167.1
89.4
170.7
98.1
101.2
107.3
98.9
156.9
101.2
176.4
81.2
86.4
92.4
82.4
144.7
88.2
176.3
36.5
37.5
38.9
46.5
76.8
47.1
81.9
124.8
123.9
138.2
197.4
173.2
183.6
184.8
99.8
106.0
111.5
147.1
147.6
151.7
161.8
83.9
77.8
82.8
207.0
133.3
207.1
149.1
89.5
89.2
77.1
170.6
142.2
165.8
142.1
73.7
75.6
76.9
95.0
113.0
97.5
121.5
32.8
33.7
35.7
60.1
76.6
60.5
82.1
40
60
80
100
120
140
160
180
P
e
rc
e
n
t 
im
p
ro
v
e
m
e
n
t 
(%
)
(a) Mandelbrot TS native performance on 128 cores
STATIC SS FSC mFSC GSS TSS FAC WF AWF-B AWF-C AWF-D AWF-E AF SimASoverhead
pea+lat-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%
pea+lat-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.3%
lat-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
lat-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.2%
pea-cs 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.5%
pea-cm 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.5%
np 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.0% 0.0% 0.0% 0.0% 50.0% 0.4%
(b) Percentage of counts DLS techniques are selected by SimAS
Figure 24: Native performance results of Mandelbrot TS without (denoted with np) and with
(the rest) perturbations using SimAS and other thirteen loop scheduling techniques on miniHPC.
Percent performance improvement normalized to STATIC in np scenario (baseline case without any
perturbations and baseline load balancing method). White, red, and blue denote baseline (= 100%),
degraded (> 100%), and improved performance (< 100%), respectively. The table shows the DLS
techniques dynamically selected by SimAS and the percent of execution time spent in SimAS calls.
40
where it is the most efficient.
The SimAS approach can proactively select the best suited DLS before any perturbations
manifest in the system, whenever perturbations can be predicted in advance. The SimAS
leverages the use of already developed simulators, instead of needing the development of novel
prediction techniques. The DLS selection decisions taken by SimAS can then be used to create
a rule-based DLS selection mechanism for a combination of application, system, and execu-
tion scenarios, to improve application performance dynamically without the need of online
simulation.
Running SimAS simulations and the dynamic selection of DLS techniques incurs overhead.
However, this overhead has a limited effect on applications’ performance. For example, the
total time spent in SimAS setup and SimAS update functions is 3.49 seconds out of 1147.55
total application execution time for the PSIA on 128 cores in the lat-cs execution scenario.
However, due to the non-preemptive property of the DLS, the execution of already scheduled
chunks of loop iterations is not preempted to be resumed with the newly selected DLS. As
shown in Figure 19b, even though the SimAS selected DLS techniques with shorter execution
times in the case of lat-cs with PSIA application on 128 cores, the execution time with
SimAS was even longer than that of SS, which was not selected by the SimAS .
In time-stepping applications, the effect of frequent DLS technique switching and the
non-preemption overhead is much less than the single-sweep applications. Therefore, the
performance of time-stepping applications with SimAS under perturbations is better than the
single-sweep versions of the same applications as can be seen in Figure 23 and Figure 24. The
preemption of scheduled (yet not executed) loop iterations may improve the performance while
switching DLS techniques.
6 Conclusion and Future Work
A new control-theoretic inspired approach, namely simulator-assisted scheduling (SimAS ) ap-
proach, was introduced to dynamically select a DLS that is predicted to deliver the best per-
formance under unpredictable perturbations. The performance of two real applications and
five synthetic workloads was studied under perturbations and insights into the resilience of the
DLS techniques to perturbations are provided. The performance results confirm the hypothesis
that no single DLS technique can achieve the best performance in all the considered execution
scenarios. Furthermore, native DLS experiments under system-induced perturbations showed
that even the computationally-intensive applications could be significantly affected by pertur-
bations in the network characteristics. The implementation choice of scheduling techniques,
such as STATIC implemented in a self-scheduling manner, led to the degradation of its perfor-
mance under network perturbations. Using the SimAS approach improved the performance
of applications in most experiments. SimAS leverages state-of-the-art simulators to select
the DLS predicted to result in the best performance of an application under perturbations.
However, due to applications being non-preemptively scheduled, changing the DLS technique
during execution may not always result in the best performance. It is planned in the future to
experiment with preempting scheduled yet not executed loop iterations upon a change in the
selected DLS technique by the SimAS approach. Furthermore, experiments to investigate and
enhance the performance of SimAS , in terms of improving the DLS selection strategy and the
period between SimAS calls, are also planned as future work.
Acknowledgments
This work has been supported by the Swiss Platform for Advanced Scientific Computing
(PASC) project SPH-EXA: Optimizing Smooth Particle Hydrodynamics for Exascale Com-
puting and by the Swiss National Science Foundation in the context of the Multi-level Schedul-
41
ing in Large Scale High Performance Computers (MLS) grant number 169123. The authors
gratefully acknowledge Ahmed Eleliemy for sharing an initial implementation of the PSIA
application.
References
[1] C. P. Kruskal and A. Weiss, “Allocating Independent Subtasks on Parallel Processors,”
IEEE Transactions on Software Engineering, vol. SE-11, no. 10, pp. 1001–1016, 1985.
[2] I. Banicescu, F. M. Ciorba, and S. Srivastava, Scalable Computing: Theory and Practice,
ch. Performance Optimization of Scientific Applications using an Autonomic Computing
Approach, pp. 437–466. No. Chapter 22, John Wiley & Sons, Inc, 2013.
[3] C. D. Polychronopoulos and D. J. Kuck, “Guided Self-Scheduling: A Practical Scheduling
Scheme for Parallel Supercomputers,” IEEE Transactions on Computers, vol. 100, no. 12,
pp. 1425–1439, 1987.
[4] T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical scheduling scheme for
parallel compilers,” IEEE Transactions on parallel and distributed systems, vol. 4, no. 1,
pp. 87–98, 1993.
[5] S. Flynn Hummel, E. Schonberg, and L. E. Flynn, “Factoring: A method for scheduling
parallel loops,” Communications of the ACM, vol. 35, no. 8, pp. 90–101, 1992.
[6] S. Flynn Hummel, J. Schmidt, R. Uma, and J. Wein, “Load-sharing in Heterogeneous
Systems via Weighted Factoring,” in Proceedings of the Annual ACM Symposium on
Parallel Algorithms and Architectures, pp. 318–328, 1996.
[7] I. Banicescu, V. Velusamy, and J. Devaprasad, “On the Scalability of Dynamic Scheduling
Scientific Applications With Adaptive Weighted Factoring,” Cluster Computing, vol. 6,
no. 3, pp. 215–226, 2003.
[8] R. L. Carin˜o and I. Banicescu, “Dynamic Load Balancing With Adaptive Factoring Meth-
ods in Scientific Applications,” Journal of Supercomputing, vol. 44, no. 1, pp. 41–63, 2008.
[9] I. Banicescu and Z. Liu, “Adaptive Factoring: A Dynamic Scheduling Method Tuned
to the Rate of Weight Changes,” in Proceedings of the High Performance Computing
Symposium, pp. 122–129, 2000.
[10] N. Sukhija, B. Malone, S. Srivastava, I. Banicescu, and F. M. Ciorba, “Portfolio-based
Selection of Robust Dynamic Loop Scheduling Algorithms Using Machine Learning,” in
Proceedings of the 28th IEEE International Parallel and Distributed Processing Sympo-
sium Workshops, pp. 1638–1647, 2014.
[11] H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, “Versatile, Scalable,
and Accurate Simulation of Distributed Applications and Platforms,” Journal of Parallel
and Distributed Computing, vol. 74, no. 10, pp. 2899–2917, 2014.
[12] A. Boulmier, I. Banicescu, F. M. Ciorba, and N. Abdennadher, “An Autonomic Approach
for the Selection of Robust Dynamic Loop Scheduling Techniques,” in Proceedings of 16th
International Symposium on Parallel and Distributed Computing, pp. 9–17, 2017.
[13] A. Mohammed and F. M. Ciorba, “A Study of the Performance of Scientific Applica-
tions with Dynamic Loop Scheduling under Perturbations.” Poster at 2018 Platform for
Advanced Scientific Computing Conference (PASC18), July 2018.
[14] A. Eleliemy, A. Mohammed, and F. M. Ciorba, “Efficient Generation of Parallel Spin-
images Using Dynamic Loop Scheduling,” in Proceedings of the 19th IEEE International
42
Conference for High Performance Computing and Communications Workshops, pp. 34–
41, 2017.
[15] B. B. Mandelbrot, “Fractal aspects of the iteration of z λz (1-z) for complex λ and z,”
Annals of the New York Academy of Sciences, vol. 357, no. 1, pp. 249–259, 1980.
[16] R. L. Carino and I. Banicescu, “A tool for a two-level dynamic load balancing strategy in
scientific applications,” Scalable Computing: Practice and Experience, vol. 8, no. 3, 2007.
[17] T. Peiyi and Y. Pen-Chung, “Processor Self-Scheduling for Multiple-Nested Parallel
Loops,” in Proceedings of the International Conference on Parallel Processing, pp. 528–
535, 1986.
[18] P. Velho and A. Legrand, “Accuracy Study and Improvement of Network Simulation in the
SimGrid Framework,” in Proceedings of the 2nd International Conference on Simulation
Tools and Techniques, p. 10, 2009.
[19] A. Mohammed, A. Eleliemy, and F. M. Ciorba, “Performance Reproduction and Pre-
diction of Selected Dynamic Loop Scheduling Experiments,” in Proceedings of the 2018
International Conference on High Performance Computing and Simulation, p. 8, 2018.
[20] A. Mohammed, A. Eleliemy, F. M. Ciorba, F. Kasielke, and I. Banicescu, “Experimental
Verification and Analysis of Dynamic Loop Scheduling in Scientific Applications,” in
Proceedings of the 17th International Symposium on Parallel and Distributed Computing,
p. 8, 2018.
[21] S. Ali, A. A. Maciejewski, H. J. Siegel, and J.-K. Kim, “Measuring the Robustness of a
Resource Allocation,” IEEE Transactions on Parallel and Distributed Systems, vol. 15,
no. 7, pp. 630–641, 2004.
[22] L.-C. Canon and E. Jeannot, “Evaluation and Optimization of the Robustness of DAG
Schedules in Heterogeneous Environments,” IEEE Transactions on Parallel and Dis-
tributed Systems, vol. 21, no. 4, pp. 532–546, 2010.
[23] Y. Yang and H. Casanova, “Rumr: Robust Scheduling for Divisible Workloads,” in Pro-
ceedings of the 12th IEEE International Symposium on High Performance Distributed
Computing, pp. 114–123, 2003.
[24] N. Sukhija, I. Banicescu, S. Srivastava, and F. M. Ciorba, “Evaluating the Flexibility of
Dynamic Loop Scheduling on Heterogeneous Systems in the Presence of Fluctuating Load
Using SimGrid,” in Proceedings of the 27th IEEE International Parallel and Distributed
Processing Symposium Workshops, pp. 1429–1438, 2013.
[25] Y. Zhang, M. Voss, and E. Rogers, “Runtime Empirical Selection of Loop Schedulers on
Hyperthreaded SMPs,” in Proceedings of the 19th International Parallel and Distributed
Processing Symposium, p. 10, 2005.
[26] H. Menon, K. Chandrasekar, and L. V. Kale, “poster: Automated load balancer selec-
tion based on application characteristics,” in Proceedings of the 22Nd ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, (New York, NY, USA),
pp. 447–448, 2017.
[27] J. B. Rawlings, “Tutorial: Overview of Model Predictive Control,” IEEE Control Systems,
vol. 20, no. 3, pp. 38–52, 2000.
[28] A. Mohammed, A. Eleliemy, F. M. Ciorba, F. Kasielke, and I. Banicescu, “An Approach
for Realistically Simulating the Performance of Scientific Applications on High Perfor-
mance Computing Systems,” Future Generation Computer Systems (FGCS), 2019.
43
[29] F. M. Ciorba, “The importance and need for system monitoring and analysis in HPC
operations and research,” in Proceedings of the 3rd bwHPC-Symposium: Heidelberg 2016,
(Heidelberg), p. 10 pp., Oct 2017.
[30] R. Mehrotra, I. Banicescu, S. Srivastava, and S. Abdelwahed, “A Power-aware Autonomic
Approach for Performance Management of Scientific Applications in a Data Center En-
vironment,” in Handbook on Data Centers, pp. 163–189, 2015.
[31] SiL: An Approach for Adjusting Applications to Heterogeneous Systems Under Perturba-
tions, (Turin), August 2018.
[32] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A Portable Programming
Interface for Performance Evaluation on Modern Processors,” International Journal of
High Performance Computing Applications, vol. 14, no. 3, pp. 189–204, 2000.
