Performance Reproduction and Prediction of Selected Dynamic Loop
  Scheduling Experiments by Mohammed, Ali et al.
Performance Reproduction and Prediction of
Selected Dynamic Loop Scheduling
Experiments
Ali Mohammed, Ahmed Eleliemy, and Florina M. Ciorba
Department of Mathematics and Computer Science
University of Basel, Switzerland
1
ar
X
iv
:1
80
5.
07
99
8v
2 
 [c
s.D
C]
  7
 Ju
n 2
01
8
Contents
1 Introduction 4
2 Background and Related Work 6
3 Reproduction and Prediction Methodology 9
4 Reproduction of Selected Experiments via Simulation 13
5 Reproduction of Selected Experiments via Native Execution 17
6 Prediction of DLS Performance via Simulation 22
7 Conclusion and Future Work 24
2
Abstract
Scientific applications are complex, large, and often exhibit irreg-
ular and stochastic behavior. The use of efficient loop scheduling
techniques, from static to fully dynamic, in computationally-intensive
applications characterized by large data-parallel loops, is crucial for
improving their performance, often degraded by load imbalance, on
high-performance computing (HPC) platforms. A number of dynamic
loop scheduling (DLS) techniques has been proposed between the late
1980’s and early 2000’s and efficiently used in scientific applications.
In most cases, the computing systems on which they have been tested
and validated are no longer available. The use of DLS for the purpose
of improving the performance of computationally-intensive scientific
applications executing on modern HPC platforms is of increased sig-
nificance today as system-induced load imbalance is exacerbated due
to systems diversity, complexity, increased size, increased heterogene-
ity, and massively parallel nature. This work is concerned with the
minimization of the sources of uncertainty in the implementation of
DLS techniques to avoid unnecessary influences on the performance
of scientific applications. Therefore, it is important to ensure that the
DLS techniques employed in scientific applications today adhere to
their original design goals and specifications. The goal of this work is
to attain and increase the trust in the implementation of DLS tech-
niques in todays studies. To achieve this goal, the performance of a
selection of scheduling experiments from the 1992 original work that
introduced factoring, an efficient DLS technique proposed for shared-
memory systems, is reproduced and predicted both, via simulative and
native experimentation. The scientific challenge is the reproduction
of the performance of the past experiments with incomplete informa-
tion, such as the computing system characteristics and the implemen-
tation details. The selected scheduling experiments involve two com-
putational kernels and four loop scheduling techniques. The experi-
ments show that the simulation reproduces the performance achieved
on the past computing platform and accurately predicts the perfor-
mance achieved on the present computing platform. The performance
reproduction and prediction confirms that the present implementation
of these DLS techniques considered both, in simulation and natively,
adheres to their original description. Moreover, the simulative and
native experiments follow the expected performance behavior for the
considered scheduling scenarios. The results confirm the hypothesis
that reproducing experiments of identical scheduling scenarios on past
3
and modern hardware leads to an entirely different behavior from ex-
pected. This work paves the way towards additional simulative and
native experimentation using further DLS techniques in the future.
Keywords. Dynamic loop scheduling; performance reproduction; perfor-
mance prediction; simulation; native experimentation; shared-memory; many-
core architecture.
1 Introduction
Dynamic loop scheduling (DLS) is an effective scheduling approach employed
in computationally-intensive scientific applications for the purpose of opti-
mizing their performance in the presence of load imbalance caused by prob-
lem, algorithmic, and systemic characteristics. The DLS techniques dynam-
ically schedule the work contained in the parallel loop iterations among the
parallel processing units whenever they become available and request work.
Over the years, DLS techniques have successfully been used in scientific ap-
plications, such as, N-body simulations, computational fluid dynamics, radar
signal processing [1], and computer vision application [2].
One of the well-known and efficient DLS techniques is factoring, intro-
duced by Hummel et al. [3] in 1992. Therein, the performance of the IBM
Research Parallel Processor Prototype (hereafter, the RP3) system [4] was
compared for the execution of three computational kernels: matrix mul-
tiplication, adjoint convolution, and Gauss-Jordan elimination, using four
scheduling techniques: straightforward parallelization (or static chunking,
STATIC), self-scheduling (SS) [5], guided self-scheduling [6] (GSS), and fac-
toring (FAC) [3]. In the present work, the scheduling behavior of the first
two computational kernels using the above four scheduling techniques is re-
produced to confirm that the implementations of the DLS techniques both,
in simulation and in native codes, adhere to their original goals and specifi-
cations [3].
Confirming the adherence of the DLS implementation to their original de-
sign goals minimizes the sources of uncertainty in their implementation and
helps avoiding unnecessary influences on the performance of scientific appli-
cations. For instance, a DLS that has been implemented to intensively use
shared memory locks will cause unnecessary scheduling overhead. Therefore,
adversely influencing performance. Another example would be that SS in
4
present implementations performs differently than in the past for the partic-
ular implementation of MM considered in the past and in this work, leading
to uncertainty whether the present SS implementation adheres to the orig-
inal one. It was found that SS is properly implemented in the present and
that the source of the discrepancy is due to the fact that MM in the past
was compute-bound while in the present it is memory-bound. The achieved
trust in the implementation of DLS techniques for shared-memory systems
has already been transferred to their implementation for distributed-memory
systems [7]. Confirming the adherence of the STATIC, SS, GSS, and FAC
implementations to their original design goals lays the foundation for con-
firming the implementation of further DLS techniques that aim to better
balance the increase in load balancing with the increase in scheduling over-
head, such as, weighted factoring [8], adaptive weighted factoring [9], and
adaptive factoring [10].
Reproducibility is a key aspect of the scientific method [11]. The re-
production of scientific experiments contributes to the validation of those
experiments and to establish that the conclusions drawn from these experi-
ments are of scientific relevance [12]. In the present work, reproduction [11]
is defined as revisiting a certain scientific problem, namely, the performance
of DLS techniques [3], without the original artifacts or the possibility to ex-
ecute the artifacts on the original computing system [4]. Reproduction is
employed in this work as a means to attain and increase the trust in native
and simulative implementations of DLS.
The scheduling experiments selected from the work of Flynn Hummel et
al. [3] were reproduced earlier [13], using simulative as well as native execu-
tion with DLS techniques implemented employing a centralized process co-
ordination approach. This work extends the reproduction of the experiments
selected from the work of Flynn Hummel et al. [3] by investigating the imple-
mentation of the DLS techniques using decentralized process coordination. In
the authors best understanding, DLS techniques were implemented originally
using decentralized process coordination. For completeness both implemen-
tations, centralized and decentralized process coordination, are compared
in this work. The simulative experiments were conducted with a simulator
developed based on the SimGrid-SimDag (hereafter, SG-SD) interface that
employs individual representations of the two HPC platforms considered:
IBM RP3 (past) and Intel Knights Landing (present). The reproduction of
the selected scheduling experiments is a means for the experimental verifi-
cation of the implementation of STATIC, SS, GSS, and FAC using SG-SD.
5
Moreover, the selected scheduling experiments were performed natively on an
Intel Knights Landing (hereafter, the KNL) architecture, whose characteris-
tics are captured in a platform file representation required by the simulator.
SG-SD is then used to predict the performance of the execution on the KNL.
The results of the native and simulative executions were compared and found
in close agreement, which increases the confidence in the simulation-based
prediction of the performance of DLS experiments.
The present work makes the following contributions: (1) Employs repro-
duction as a means to experimentally verify the SG-SD implementation of
STATIC, SS, GSS, and FAC by comparing the present simulation results
with the corresponding results on the RP3 from the work of Flynn Hummel
et al. [3]. (2) Repeats the selected scheduling experiments [3] on the KNL 7210
processor to explore whether conclusions of the past experiments hold on a
modern computing system. (3) Introduces a SG-SD-based simulator to sim-
ulate and predict the behavior of two computational kernels using four loop
scheduling techniques [3] that employ the decentralized process coordination
approach. Experimentally verified implementations of DLS techniques can
be useful for studying the relation between their use at different levels of
scheduling [14]. Moreover, the present work enables future studies on the
scheduling behavior under various scheduling scenarios and in the presence
of variable application and system properties.
The remainder of this work is structured as follows: The background on
DLS techniques, the simulation toolkit, as well as an overview of relevant re-
producibility studies are reviewed in Section 2. The proposed methodology
for performance reproduction and prediction of DLS is described in Section 3.
The reproduction of the selected experiments on the RP3 is presented in Sec-
tion 4. The reproduction of the selected scheduling experiments on the KNL
architecture is detailed in Section 5. The performance of the KNL-based ex-
periments predicted with SG-SD is compared against the performance of the
native experiments on the KNL and discussed in Section 6. The conclusion
and insights into future work are outlined in Section 7.
2 Background and Related Work
This section reviews the dynamic loop scheduling techniques and the SimGrid
simulation toolkit. A number of relevant reproducibility studies are also
discussed.
6
Dynamic loop scheduling. The loop scheduling techniques considered
in this work can be classified into static and dynamic. Using straightfor-
ward parallelization (denoted STATIC), the parallel loop iterations are di-
vided into equally-sized chunks. A processor is assigned exactly one chunk
of iterations equal to the overall number of loop iterations (N) divided by
the number of available processing elements (P ). STATIC has a very low
scheduling overhead (h), bounded above by P . Application performance may
be degraded due to load imbalance if the execution of the loop iterations is
characterized by high variability. Self-scheduling [5] (SS), is a dynamic loop
scheduling technique, at the other scheduling extreme, whereby a processing
element obtains a chunk consisting of exactly one loop iteration whenever
it becomes available and requests work. When all loop iterations have been
self-scheduled, the processors finish their execution at virtually the same time
due to the fine-grain self-balancing of the workload. Scheduling a single loop
iteration at a time leads to increased scheduling overhead over STATIC and
potentially to an overall completion time greater than the optimal time.
A number of other DLS techniques provide a trade-off between mini-
mizing scheduling overhead and maximizing the load balancing. Two such
techniques are guided self-scheduling (GSS) [6] and factoring (FAC) [3]. GSS
assigns a chunk of loop iterations to an available and requesting processor
that is equal to the number of the remaining unscheduled loop iterations (R)
divided by the number of the processors P . Therefore, chunks are of de-
creasing sizes, and workload can be balanced among the processors also in
the case of uneven processor start times. Even though GSS offers a good com-
promise between load balancing and scheduling overhead, it assigns a very
large chunk to the first available worker. The execution of this chunk can
dominate the application performance leading to load imbalance. FAC [3]
is designed to balance the execution of loop iterations with variable execu-
tion times. It assigns chunks of loop iterations to available and requesting
workers in batches vs. single loop iterations at a time, therefore, reducing
the scheduling overhead h. The number of the loop iterations in a chunk
depends on the remaining number of loop iterations R and on the coefficient
of variation (c.o.v.) of the loop iterations execution times.
Loop scheduling in simulation. SimGrid [15] is a scientific simulation
framework for the study of the behavior of large-scale distributed computing
systems, such as, the Grid, the Cloud, and peer-to-peer (P2P) systems. It
provides ready-to-use models and application programming interfaces (APIs)
to simulate various distributed computing systems. SimGrid (hereafter,
7
SG) provides four different APIs for different simulation purposes. The
MetaSimGrid (MSG) and SimDag (SD) provide APIs for the simulation of
computational problems expressed as parallel independent tasks or as paral-
lel task graphs, respectively. The SMPI interface provides the functionality
for the simulation of programs written using the message passing interface
(MPI) and targets developers interested in the simulation and debugging of
their parallel MPI codes. The newly introduced S4U interface currently sup-
ports most of the functionality of the MSG interface with the purpose of
also incorporating the functionality of the SD interface over time. This work
considers the SG-SD interface.
Related work. DLS techniques have previously been implemented in
the SG-MSG interface with the purpose of studying their scalability [16] and
robustness against load imbalance [17]. Moreover, a number of DLS tech-
niques were also implemented in SG-MSG to study their resilience in a het-
erogeneous computing system [18]. Another closely related study performed
simulation-based reproduction (using SG-MSG) to confirm and validate the
implementation of several DLS techniques in simulation [19].
Two approaches can be employed to implement process coordination in
the DLS techniques natively or in simulation: (1) Centralized process co-
ordination, using a master-worker execution model; and (2) Decentralized
process coordination, wherein each “worker thread” calculates and executes
a chunk of work whenever it becomes available. A first effort to reproduce a
selection of experiments from [3] using simulation considered the DLS tech-
niques implemented using the master-worker execution model [13].
The present work extends and complements previous work [13] by inves-
tigating the reproduction of a selection of experiments [3] with the DLS tech-
niques employing a decentralized process coordination approach using only
“worker threads” without a “master thread”. In this approach, threads (or
processes) are responsible for obtaining work on their own from the central
work queue shared via memory, eliminating the master-side contention that
characterizes the centralized process coordination model. The goal of this
work is the use of performance reproduction and performance prediction as
a means of experimental verification of the adherence of the DLS techniques
implementation in SG-SD to the original design goals and specifications.
8
3 Reproduction and Prediction Methodology
In this work, four scheduling techniques: STATIC, SS, GSS, and FAC are
implemented in SG-SD. To confirm the implementation of these schedul-
ing techniques, selected scheduling experiments from the original publica-
tion [3] are reproduced using simulation. As mentioned earlier in Section 2,
the DLS techniques under study can be implemented in one of two ap-
proaches: (1) Centralized process coordination; (2) Decentralized process co-
ordination. In this work, the DLS techniques implemented using the de-
centralized process coordination approach are investigated. In a centralized
process coordination approach, the master calculates and assigns chunks of
iterations to available and requesting workers. Also, the master can be ded-
icated or act as a worker as well when there are no requests to serve from
workers. The decentralized approach studied in the present work is close
to the implementation described in the original work [3], where each thread
calculates and obtains a chunk of work when it becomes available. Atomic
operations are used instead of locks to optimize the implementation as pro-
posed in the original publication. The results of simulating and executing
the scheduling experiments using centralized (from previous work [13]) and
decentralized (from the present work) process coordination approaches are
compared in Sections 4 and 5.
The reproduction and prediction approach consists of three steps as illus-
trated in Figure 1. Step 1 of the reproduction process is described in Section 4.
The results of the simulated experiments are compared with the results from
the original publication [3] to confirm the present implementation of the DLS
techniques in the SG-SD simulator. The results of the original paper [3] were
extracted from the figures using web plot digitizer1.Comparing the repro-
duced results with the original results ensures that the DLS techniques are
delivering the same performance as in the original publication, and hence the
verification of their implementation. The poor implementation of the DLS
techniques may lead to load imbalances that should have been avoided using
the DLS techniques.
The selected DLS experiments are reproduced on the state-of-the-art many-
core processor architecture, the KNL, as shown in Step 2 in Figure 1. The
KNL is representative of modern manycore architectures that exhibit a high
degree of parallelism, rendering it, therefore, an interesting architecture for
1https://apps.automeris.io/wpd/
9
the study of DLS. The details of the implementation and the reproduction of
the DLS experiments on the KNL are presented in Section 5. This allows to
examine whether the conclusions drawn from the DLS experiments described
in the original work [3] are influenced by the underlying system. One can also
examine whether the advancements in computer systems over three decades
alter the conclusions of publications of the past.
The SG-SD simulator is configured (cf. Section 6) to predict the perfor-
mance of the selected experiments on the KNL architecture instead of the
RP3 system denoted by Step 3 in Figure 1. The native execution results
from Section 5 are compared with the results of the simulated execution
from Section 6 to attain trustworthiness in the SG-SD-based prediction of
the performance of the selected DLS experiments. The proposed reproduc-
tion methodology in Figure 1 can be used in other scheduling studies to
confirm the implementation of scheduling techniques (Step 1) and confirm
the correctness of the simulation (Step 2).
Representation of the RP3 Representation of the KNL
Step 3 
DLS implementation in SG-SD
Computational kernels representation in SG-SD
Execution on KNL
DLS implementation
Computational kernels
Simulation
Native execution Native execution
Artifacts in the present work
Artifacts in [3]
Step 2
Step 1
Execution on RP3
DLS implementation
Computational kernels
Legend:
Figure 1: Proposed reproduction and prediction methodology.
10
Selection of the DLS Experiments The original paper [3] compared
the performance of executing three different computational kernels: matrix
multiplication, adjoint convolution, and Gauss-Jordan method on the RP3
system using four different scheduling techniques STATIC, SS, GSS, and
FAC. Two matrix sizes were used as input for each of the three kernels. Two
variations of the adjoint convolution kernel were considered: with increasing
task sizes and with decreasing task size. All scheduling experiments in [3]
were performed on the RP32 system [4].
Algorithm 1: Parallel matrix multiplication (MM)
Input: Matrices A and B each of size n× n
Output: Matrix C of size n× n
Data: A, B, n
Result: C ← A×B
1 for k = 1 : n× n do in parallel
2 i← k/n
3 j ← k − n× (k − 1)/n
4 C[i, j]← 0
5 for l = 1 : n do
6 C[i, j]← C[i, j] + A[i, l]×B[l, j]
7 end
8 end
The matrix multiplication (MM) and adjoint convolution with decreasing
task sizes (AC-d) kernels are selected for reproduction and prediction in
this work, with matrix sizes of 300 × 300 and 75 × 75, respectively. The
computational kernels are described in Algorithms 1 and 2. Larger matrices
are used in the scheduling experiments on the KNL to arrive at an execution
cost on the KNL close to that of the experiments on the RP3 system. Using
large matrices results in a longer program execution time. Errors in the
execution time measurements and in the overhead are small compared to the
measurement of the program execution. These timings are negligible, i.e., the
time measurement function calls require 16.15 microseconds for a program
execution time of 329 seconds. The matrix sizes of 5500× 5500 and 600× 600
2Each processor had its own local memory, configured in a shared address space. Every
processor-memory element was connected to other elements using the Omega network [4].
11
Algorithm 2: Parallel adjoint convolution (AC-d)
Input: Two matrices A and B each of size n× n
Output: Matrix C of size n× n, where C the adjoint convolution of
A and B
Data: A, B, C, n, const
1 for k = 1 : n× n do in parallel
2 C[k]← 0
3 for l = k : n× n do
4 C[k]← C[k] + const× A[l]×B[l − k]
5 end
6 end
are used for the MM and the AC-d kernels on the KNL, respectively. The
selected experiments details are summarized in Table 1.
Table 1: Selected scheduling experiments
Computational kernel
Matrix
size
Scheduling
method
Number of
processors
Matrix multiplication (MM) 300× 300 STATIC, SS
GSS, FAC
4, 8, 16, 24,
32, 40, 48, 56Adjoint convolution with
decreasing task sizes (AC-d)
75× 75
The two kernels represent two different task granularities: equal task sizes
and decreasing task sizes. Each iteration of a kernel’s for loop was considered
a task to be scheduled.
The open-source simulator and the raw results obtained from simulated
and native executions for this work are available online [20]. An Easybuild3
configuration file is also provided to ensure the creation of an experimental
environment that is similar to the one used for this work.
3http://easybuild.readthedocs.io
12
4 Reproduction of Selected Experiments via
Simulation
To confirm the implementation of the four scheduling techniques in SG-SD,
the selected scheduling experiments on the RP3 system [3] are reproduced
and compared with those obtained using SG-SD [15]. Every iteration of a
kernel’s outer loop is modeled as a SG-SD sequential computation task. The
amount of work contained in each computational task is specified in number
of floating point operations (FLOP) in the simulator. For both MM and AC-
d, the FLOP count in each iteration is inferred from their pseudocodes. This
number is used in the simulator as the amount of work in each sequential
computation task. The DLS techniques are implemented using decentralized
process coordination. The pseudocode of the SG-SD simulator of the parallel
execution of the two kernels with DLS techniques is listed in Algorithm 3.
Algorithm 3: SG-SD pseudocode — decentralized process coordina-
tion
Input: platformFile, numThreads, kType, pSize, method
Output: simulatedTime
Data: schedulingStep, scheduledTasks, chunkSize, hosts, tasks,
numTasks, dummyTask, dummyComm, changedTasks
1 numTasks← pSize× pSize
2 tasks← CreateTasks(kType, pSize)
3 foreach i ∈ numTasks do
4 SD task watch(tasks[i], SD DONE)
5 end
/* Create a computational task to represent create threads
overhead */
6 dummyTask ← SD task create comp seq(“createThreads”, thread
creation overhead according to numThreads in FLOPs)
7 SD task schedule(dummyTask, hosts[0])
13
Algorithm 3: SG-SD pseudocode — decentralized process coordina-
tion - continued
/* Run the simulation until a task is completed */
8 while !(is empty(changedTasks=SD simulate(-1))) do
9 for i = 0 : numThreads do
10 if (scheduledTasks < numTasks) and is free(hosts[i]) then
11 chunkSize← calculate chunk size(numTasks,
numThreads, schedulingStep, method)
/* Create a scheduling overhead task according to
the scheduling method */
12 dummyTask ← SD task create comp seq(“scheduling
overhead”, scheduling overhead in FLOPs correponding to
method)
13 dummyComm← SD task create end end comm(“assigning
chunk”, chunkSize× pSize× 8)
/* Add dependencies between calculating chunk
overhead, assigning overhead and the start of
the execution of the chunk of tasks */
14 SD task schedule(dummyTask, hosts[i])
15 Schedule comm A to B(dummyTask, hosts[0], hosts[i])
/* Schedule the chunk of tasks */
16 for j = 0 : chunkSize do
17 SD task schedule(tasks[scheduledTasks], hosts[i])
18 Increment scheduledTasks
19 end
20 Increment schedulingStep
21 end
22 end
23 end
24 Print the simulated time
25 Terminate the program
A SG-SD sequential computation task is created to represent the schedul-
ing overhead of each DLS technique. This task is scheduled on the available
thread in each simulated scheduling round. The amount of work performed
by each of these scheduling tasks varies and depends on the selected schedul-
ing technique. The values for the amount of work performed by each of these
scheduling overhead tasks are obtained empirically, to match the simulation
14
results to the results in the original publication [3]. Specifically, they are
found to be 75, 400, 750, and 750 FLOP for STATIC, SS, GSS, and FAC
techniques, respectively. A SG-SD end-to-end communication task is also
created in each scheduling round to simulate the time taken to send the as-
signed chunk of tasks from process 0, to the available process that needs
work. It is assumed that, initially, process 0 stores all the data, and other
processes transfer one column of the matrix from process 0 for every task
they obtain. This data strategy is referred to as pool of tasks and data.
The amounts for computation (FLOP) and communication (Byte) in each
loop iteration for the two selected computational kernels are presented in
Table 2. Two factors, g1 and g2, are used to capture the unknown effects in
the execution of the computational kernels on the RP3 system. These factors
cover all software- and hardware-related aspects that may influence program
execution on RP3, e.g., memory system and operating system interference.
These factors are presented in Table 2, are unitless and are experimentally
determined to be 35 and 60, respectively.
To provide the SimGrid simulation engine with the specifications of the
simulated system, it requires an XML file as a platform file. Each processor in
the RP3 system is represented as a host in the SimGrid platform file used in
the reproduction experiments. All hosts (processors) are interconnected by
creating a communication link between every host and all others. Additional
details about the RP3 system are extracted from the work that introduced
the RP3 system [4], such as processor speed (1.562 MFLOP/s), network
bandwidth (50 Mbit/s), and latency (2 µs).
All simulations are performed using SG-SD 3.16 on a manycore compute
node with an Intel KNL processor (7210) running at 1.3 GHz, using Cen-
tOS operating system, version 7.2.1511. The GNU C compiler, version 6.3.0,
is used for the compilation of the simulator with -g -Wall as compilation
flags.
Table 2: Computational kernels parameters for their simulation on the RP3
system.
Computational
kernel
Task size (FLOP) Communication size (Byte)
MM g1 × (5 + 2× rowLength) chunkSize× rowLength
AC-d g2 × 3× (matrixSize− iterationID) chunkSize× rowLength
15
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(a
)
S
T
A
T
IC
—
M
M
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(b
)
S
S
—
M
M
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(c
)
G
S
S
—
M
M
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(d
)
F
A
C
—
M
M
0
3,
00
0
6,
00
0
9,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(e
)
S
T
A
T
IC
—
A
C
-d
0
3,
00
0
6,
00
0
9,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(f
)
S
S
—
A
C
-d
0
3,
00
0
6,
00
0
9,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(g
)
G
S
S
—
A
C
-d
0
3,
00
0
6,
00
0
9,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	p
ro
ce
ss
or
s
(h
)
F
A
C
—
A
C
-d
SG
-S
D-
D	
sim
ul
at
io
n	
of
	R
P3
O
rig
in
al
	n
at
ive
	e
xe
cu
tio
n	
on
	R
P3
SG
-S
D-
C	
sim
ul
at
io
n	
of
	R
P3
F
ig
u
re
2:
S
im
u
la
ti
on
re
su
lt
s
fo
r
th
e
se
le
ct
ed
D
L
S
ex
p
er
im
en
ts
on
th
e
R
P
3
sy
st
em
u
si
n
g
a
d
e-
ce
n
tr
al
iz
ed
p
ro
ce
ss
co
or
d
in
at
io
n
(S
G
-S
D
-D
)
ob
ta
in
ed
w
it
h
S
G
-S
D
(r
ed
b
ar
s)
co
m
p
ar
ed
w
it
h
th
e
si
m
u
la
ti
on
re
su
lt
s
fo
r
th
e
se
le
ct
ed
D
L
S
ex
p
er
im
en
ts
on
th
e
R
P
3
sy
st
em
u
si
n
g
a
ce
n
tr
al
iz
ed
p
ro
-
ce
ss
co
or
d
in
at
io
n
(S
G
-S
D
-C
)
[1
3]
(b
lu
e
b
ar
s)
an
d
th
e
or
ig
in
al
p
u
b
li
ca
ti
on
[3
]
re
su
lt
s
(b
la
ck
b
ar
s)
.
P
ar
al
le
l
co
st
=
p
ar
al
le
l
p
ro
gr
am
ex
ec
u
ti
on
ti
m
e
×
n
u
m
b
er
of
th
re
ad
s.
16
Results of Reproduction The selection of parallel cost as a performance
metric (over the parallel execution time) is due to the fact that the parallel
cost was used in the original publication [3] that this work compares against.
The parallel cost reflects the sum of the time that each processing element
spends solving the problem [21]. The simulative performance for execut-
ing MM and AC-d using a decentralized coordination approach with SG-SD
(SG-SD-D) compared against the original native performance results [3] is
illustrated in Figure 2. These results show that the simulation performance
is close to the native performance in the original publication. The simulative
performance of the same experiments using a centralized process coordina-
tion [13] (SG-SD-C) is also compared against the original native performance
results [3] in Figure 2. The percent error (%E) between the simulative execu-
tion time in this work (Tsim) and the original native execution time (T
o
nat) [3]
is calculated as: %E =
(
1− Tsim
T onat
)
× 100.
A positive percent error %E indicates that the simulator underestimates
the original execution time, while a negative %E signifies overestimation.
The minimum absolute %E between SG-SD-D and the native execution
is 0.073%, for GSS — AC-d and 56 threads, as can be observed from Fig-
ure 2(g). The maximum absolute %E is 45.89% in the case of SS — MM
and 4 threads, as can be observed from Figure 2(b). The average of the
absolute %E is 10.89% in all the scheduling experiments on the RP3 system
and the SG-SD simulation results shown in the present work. For the results
centralized process coordination [13], the minimum and the maximum ab-
solute %E are 0.49%, and 30.94%, respectively, in the case of GSS — AC-d
and 24 threads (see Figure 2(g)) and SS — AC-d and 56 threads (see Fig-
ure 2(f)). The average of the absolute %E is 7.44% between the simulative
results [13] (SG-SD-C) and the native execution results [3]. The simulative
results follow a similar trend to the original native experiments, which is of
high relevance for the comparison of different scheduling techniques. These
results confirm that the implementation of the considered DLS
techniques in SG-SD adheres to their implementation used in the
original publication [3].
5 Reproduction of Selected Experiments via
Native Execution
17
For the purpose of reproducing the selected experiments [3] via native exe-
cution in the present work, the two computational kernels were implemented
in C. Their parallelization considers the scheduling techniques STATIC, SS,
GSS, and FAC using Pthreads [22]. The Pthreads threading library is chosen
due to its lightweight threading and its efficiency in communication and data
exchange on shared memory computing systems.
A decentralized process coordination is used in parallelizing the com-
putational kernels with the DLS techniques. The main thread, thread 0,
creates a number of threads equal to the number of cores in the current
experiment minus one (the main thread). All threads, including the main
thread, execute the code described in Algorithm 4. Each thread obtains
work using the obtain work function described in Algorithm 5. The program
holds two global variables that represent the current state of the program:
schedulingStep and currentIndex. The currentIndex represents the loop
index of the outer loop that is parallelized of the computational kernels and
indicates the program progress. The obtain work function updates these two
variables after each work assignment to advance the program state. Updates
to these global variables (Lines 1 and 3 in Algorithm 5) are performed using
atomic operations to avoid data races between parallel threads. The size of
the allocated chunk of work is calculated by the selected loop scheduling tech-
nique. All threads are pinned on the cores of the experiments platform using
the scatter strategy, to ensure better and more stable performance among
execution runs.
The parallel cost is reported for each experiment. This cost is calculated
as the product of the program’s parallel execution time and the number of
threads. A script to run the experiments and calculate the confidence interval
is used to execute each experiment for a minimum number of 20 times and
maximum of 100. For all the experiments, the script stops after 20 times
yielding a confidence interval of less than 5% and a confidence level of 95%.
Reproduction Results The performance results in terms of parallel cost
of the native execution of the DLS techniques implemented using decentral-
ized process coordination are compared to the performance results of the
native execution using the centralized approach [13] in Figure 3. As can be
observed from Figure 3, the parallel cost of the centralized process coordina-
tion increases as the number of threads increases. This can be attributed to
the effect of multiple threads competing to lock and unlock the work queue
18
Algorithm 4: Decentralized DLS — thread execution
Input: theadID
Output: void
Global data: method, schedulingStep, currentIndex
Local data: start, chunkSize
1 while True do
2 if obtain work(start, chunkSize) then
3 execute kernel(start, chunkSize)
4 else
5 break
6 end
7 end
/* Exit thread */
8 return NULL
Algorithm 5: Decentralized DLS — obtain work
Input: method, schedulingStep, currentIndex
Output: start, chunkSize
Global data: method, schedulingStep, currentIndex, numTasks
Local data: myStep
1 myStep← fetch and add(schedulingStep, 1)
2 chunkSize← calculate chunk(numThreads, numTasks, myStep)
3 start← fetch and add(currentIndex, chunkSize)
4 if start < numTasks then
5 if start+ chunkSize >= numTasks then
6 chunkSize← numTasks− start
7 return True
8 else
9 return False
10 end
and the master serving work requests and update the same data structure
concurrently. In the present decentralized implementation, threads obtain
work on their own from a pool of tasks (accessed via a shared loop index)
whenever they become available. The updates of the shared variables be-
tween threads are performed using atomic operations. Therefore, the decen-
19
tralized DLS implementation does not exhibit the same parallel cost increase
as the centralized implementation [13] as the number of threads increases.
The performance behavior of the four loop scheduling techniques is com-
pared in the execution of the selected experiments on the KNL processor,
illustrated in Figure 3, and the execution on the RP3 system, illustrated in
Figure 2. The performance trend of the scheduling techniques in Figure 3 is
comparable to that in Figure 2 for the AC-d kernel, as can be observed by
comparing sub-figures (e,f,g,h) in both Figure 2 and Figure 3, yet the abso-
lute performance is different. However, for the MM kernel, the performance
trend and absolute values of the scheduling techniques in the past and in
the present differs significantly. The original results, in Figure 2 (b) and (d),
suggest that SS and FAC yield almost similar performance. Examining the
results in Figure 3 (b) and (d), one can notice a significantly different per-
formance from the one in Figure 2 (b) for the same computational kernel. In
the present results, STATIC, GSS, and FAC outperform SS, which exhibits
the poorest performance for the MM kernel. The poor performance of SS
is due to its large scheduling overhead and the fine granularity of the MM
loop iteration for the matrix size under study. The achieved trust in the
implementation of DLS techniques for shared-memory systems has already
been transferred to their implementation for distributed-memory systems [7].
These experiments are performed on a single KNL processor running Cen-
tOS operating system version 3.10.0-327.el7.x86 64, with a single 64-core
Intel Xeon Phi standalone processor, version 7210. The computational ker-
nels codes are compiled using GNU C compiler version 6.3.0, with the -O3
-mavx512f -mavx512cd -mavx512er -mavx512pf optimization flags. The KNL
processor is booted in the memory mode [23]. As the computational kernels
under study are not memory intensive and the main focus of this work is on
studying the effect of scheduling on the execution time, the MCDRAM is,
however, not used. All memory allocations being performed on the regular
DDR4 memory. To make the results in this work reproducible, the GNU C
compiler is used instead of the Intel C compiler. Certain compilation lags
(see above) are used to optimize the generated code for the KNL processor
and obtain the best possible performance using the GNU C compiler.
20
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
10
,0
00
12
,0
00
14
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(a
)
S
T
A
T
IC
—
M
M
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
10
,0
00
12
,0
00
14
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(b
)
S
S
—
M
M
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
10
,0
00
12
,0
00
14
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(c
)
G
S
S
—
M
M
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
10
,0
00
12
,0
00
14
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(d
)
F
A
C
—
M
M
0
3,
00
0
6,
00
0
9,
00
0
12
,0
00
15
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(e
)
S
T
A
T
IC
—
A
C
-d
0
3,
00
0
6,
00
0
9,
00
0
12
,0
00
15
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(f
)
S
S
—
A
C
-d
0
3,
00
0
6,
00
0
9,
00
0
12
,0
00
15
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(g
)
G
S
S
—
A
C
-d
0
3,
00
0
6,
00
0
9,
00
0
12
,0
00
15
,0
00
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(h
)
F
A
C
—
A
C
-d
N
at
ive
	ce
nt
ra
liz
ed
	e
xe
cu
tio
n	
on
	K
N
L
N
at
ive
	d
ec
en
tra
liz
ed
	e
xe
cu
tio
n	
on
	K
N
L
F
ig
u
re
3:
P
ar
al
le
l
co
st
of
th
e
n
at
iv
e
ex
ec
u
ti
on
of
se
le
ct
ed
D
L
S
ex
p
er
im
en
ts
on
th
e
K
N
L
p
ro
ce
ss
or
w
it
h
D
L
S
im
p
le
m
en
te
d
u
si
n
g
a
d
ec
en
tr
al
iz
ed
p
ro
ce
ss
co
or
d
in
at
io
n
(d
ar
k
b
lu
e
b
ar
s)
co
m
p
ar
ed
w
it
h
th
e
n
at
iv
e
ex
ec
u
ti
on
re
su
lt
s
w
it
h
D
L
S
im
p
le
m
en
te
d
u
si
n
g
a
ce
n
tr
al
iz
ed
(m
as
te
r-
w
or
ke
r)
p
ro
ce
ss
co
or
d
in
at
io
n
(g
re
en
b
ar
s)
.
P
ar
al
le
l
co
st
=
p
ar
al
le
l
p
ro
gr
am
ex
ec
u
ti
on
ti
m
e
×
n
u
m
b
er
of
th
re
ad
s.
21
6 Prediction of DLS Performance via Simu-
lation
The SG-SD simulator from Section 4 is used to predict the performance of
the DLS experiments of interest on the KNL processor using simulation. A
close agreement between the performance prediction using simulation and
the native execution represents an experimental validation of the simulation.
The experimental validation of the simulation is essential to attain trust-
worthiness in the prediction results of the simulation in future experiments.
Accurate parameters that describe the execution on the KNL processor are
required as input to the simulator. In this work, three such parameters have
been identified: (1) Pthreads library: Creation of threads; (2) Scheduling
overhead; and (3) Task execution time.
Timers are inserted in the source code of the computational kernels around
the functions that represent these parameters. For instance, timers are in-
serted before and after Line 2 in Algorithm 4 to measure the scheduling
overhead and before and after Line 3 in Algorithm 4 to measure the task
execution time. The measurement procedure described in Section 5 is used
to ensure the accuracy of the time measurements. These time overheads (in
the microsecond range), are multiplied by the nominal computing speed of a
KNL core of 41, 600 MFLOP/s to obtain the computational effort as FLOP.
The scheduling overheads and tasks execution times are read during simula-
tion from a file. This way, the simulator can account for these overheads and
make more accurate predictions of the execution time. As all threads on KNL
share the available memory, there is no over the network inter-thread com-
munication. Therefore, the communication size is set to 0 Byte in the SG-SD
simulator for both computational kernels.
22
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
M
ax
im
um
	%
E
(a
)
S
T
A
T
IC
—
M
M
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(b
)
S
S
—
M
M
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(c
)
G
S
S
—
M
M
0
1,
00
0
2,
00
0
3,
00
0
4,
00
0
5,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
M
in
im
um
	%
E
(d
)
F
A
C
—
M
M
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
N
um
be
r	o
f	c
or
es
(e
)
S
T
A
T
IC
—
A
C
-d
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(f
)
S
S
—
A
C
-d
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(g
)
G
S
S
—
A
C
-d
0
2,
00
0
4,
00
0
6,
00
0
8,
00
0
4
8
16
24
32
40
48
56
Parallel	cost	(s)
Nu
m
be
r	o
f	c
or
es
(h
)
F
A
C
—
A
C
-d
N
at
ive
	d
ec
en
tra
liz
ed
	e
xe
cu
tio
n	
on
	K
N
L
SG
-S
D-
D	
sim
ul
at
io
n	
of
	K
N
L
F
ig
u
re
4:
P
ar
al
le
l
co
st
of
th
e
ex
ec
u
ti
on
of
se
le
ct
ed
D
L
S
ex
p
er
im
en
ts
ob
ta
in
ed
w
it
h
S
G
-S
D
(r
ed
b
ar
s)
co
m
p
ar
ed
w
it
h
th
e
co
st
of
th
e
n
at
iv
e
ex
ec
u
ti
on
re
su
lt
s
on
th
e
K
N
L
p
ro
ce
ss
or
(d
ar
k
b
lu
e
b
ar
s)
.
P
ar
al
le
l
co
st
=
p
ar
al
le
l
p
ro
gr
am
ex
ec
u
ti
on
ti
m
e
×
n
u
m
b
er
of
th
re
ad
s.
23
The theoretical speed of a single core of the KNL is calculated using infor-
mation from the publication that introduced it [23]. The single core speed of
the KNL is found to be 41.6 GFLOP/s. Even though the SG-SD simulator,
in this experiment, simulates a shared memory system and the network is
unused, the parameters for describing the network are still required in the
platform file. The values that describe the processor speed, network band-
width, and network latency of the KNL system used in these experiments
are 41, 600 MFLOP/s, 100 Gbit/s, and 100 ns, respectively. SimGrid is a
multithreaded simulator. To reduce the simulation time of SimGrid-based
experiments, it is executed in parallel on the KNL hyperthreaded processor,
with 256 hardware threads.
Performance Prediction Results The results of the execution of the
selected DLS experiments on the KNL are compared to the simulation-based
prediction results obtained with SG-SD-D in Figure 4. The results show that
the simulated execution behavior is in agreement with the native execution
for the four different DLS techniques in executing the two computational
kernels under test.
The percent error %E between the results of the simulation of KNL execu-
tion and the results of KNL execution is calculated as described in Section 4.
The minimum absolute %E is 0.00948%, for FAC — MM with 4 threads as
can be observed from Figure 4(d). The maximum absolute %E is 21.42%, for
STATIC — MM with 40 threads as can be observed from Figure 4(a). The
average of the absolute %E is 1.94% between all the scheduling experiments
on the KNL and their corresponding SG-SD simulation presented in this
work. The average percent error values correspond to acceptable differences.
More importantly, the performance trends of the studied DLS techniques
are similar between the native and simulative executions. Therefore, it
can be stated that the simulator predicts the performance of the
two computational kernels with the four implemented scheduling
techniques on the present computing system.
7 Conclusion and Future Work
In this work, the reproduction of the behavior of two computational ker-
nels [3] has been used to confirm the adherence of the present implemen-
tation of four scheduling techniques to the original specification [3]. The
24
achieved trust in the implementation of DLS techniques for shared-memory
systems has also been transferred to their implementation for distributed-
memory systems [7]. Moreover, the reproduction, in the present, of previous
scheduling experiments on modern hardware, is used to evaluate the hypoth-
esis that the results and conclusions from past experiments are influenced by
the modern software stack and hardware systems used in the present work.
In contrast to the earlier results of Flynn Hummel et al. [3] which indicate
that both FAC and SS perform comparably, this work shows that it is sig-
nificantly inefficient to use the SS technique for the particular form of the
MM kernel considered herein. This behavior can be attributed to the mas-
sive increase in the hardware computing speed since 1992 [3], and to the
fact that the MM loop iterations that were considered of large granularity
in earlier work, are now shown to be of small granularity. Consequently,
the overhead of allocating work in SS is larger than the time to execute an
MM loop iteration. Hence, the performance of the SS technique is presently
dominated by the scheduling overhead. The main contribution of this work
is a confirmation of the hypothesis that reproducing experiments of identical
scheduling scenarios on past and modern hardware may lead to an entirely
different behavior from what is expected.
A comprehensive study of the performance behavior of dynamic loop
scheduling techniques for various applications and on several architectures is
envisioned as part of future work. This work lays the foundation and moti-
vates the reproduction and experimental verification of other DLS techniques
and their implementations in other simulators. The study of the performance
of scientific applications with various DLS techniques under perturbations
and failures in the computing system is envisioned as future work.
Acknowledgment
This work is partly funded by the Swiss National Science Foundation in
the context of the “Multi-level Scheduling in Large Scale High Performance
Computers” (MLS) grant, number 169123.
25
References
[1] R. L. Carin˜o and I. Banicescu, “Dynamic load balancing with adaptive
factoring methods in scientific applications,” The Journal of Supercom-
puting, vol. 44, no. 1, pp. 41–63, 2008.
[2] A. Eleliemy, A. Mohammed, and F. M. Ciorba, “Efficient generation of
parallel spin-images using dynamic loop scheduling,” in Proceedings of
the 19th IEEE International Conference for High Performance Comput-
ing and Communications Workshops, pp. 34–41, 2017.
[3] S. Flynn Hummel, E. Schonberg, and L. E. Flynn, “Factoring: A method
for scheduling parallel loops,” Communications of the ACM, vol. 35,
no. 8, pp. 90–101, 1992.
[4] G. F. Pfister, W. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder,
K. P. McAuliffe, E. Melton, A. Norton, and J. Weiss, “The IBM research
parallel processor prototype (RP3): Introduction and architecture,” in
International Conference on Parallel Processing, pp. 764–772, 1985.
[5] P. Tang and P.-C. Yew, “Processor self-scheduling for multiple-nested
parallel loops,” in International Conference on Parallel Processing,
vol. 86, pp. 528–535, 1986.
[6] C. D. Polychronopoulos and D. J. Kuck, “Guided self-scheduling: A
practical scheduling scheme for parallel supercomputers,” IEEE Trans-
actions on Computers, vol. 100, no. 12, pp. 1425–1439, 1987.
[7] A. Mohammed, A. Eleliemy, F. M. Ciorba, F. Kasielke, and I. Banicescu,
“Experimental verification and analysis of dynamic loop scheduling in
scientific applications,” in Proceedings of the 17th International Sympo-
sium on Parallel and Distributed Computing, p. 8, 2018.
[8] S. Flynn Hummel, J. Schmidt, R. Uma, and J. Wein, “Load-sharing
in heterogeneous systems via weighted factoring,” in Proceedings of the
eighth Annual ACM Symposium on Parallel Algorithms and Architec-
tures, pp. 318–328, ACM, 1996.
[9] I. Banicescu, V. Velusamy, and J. Devaprasad, “On the scalability of
dynamic scheduling scientific applications with adaptive weighted fac-
toring,” Cluster Computing, vol. 6, no. 3, pp. 215–226, 2003.
26
[10] I. Banicescu and Z. Liu, “Adaptive Factoring: A dynamic scheduling
method tuned to the rate of weight changes,” in Proceedings of the High
Performance Computing Symposium, pp. 122–129, 2000.
[11] ACM, “Artifact review and badging.”
https://www.acm.org/publications/policies/artifact-review-badging,
2016. [Online; accessed 24 October 2017].
[12] S. Hunold and J. L. Tra¨ff, “On the state and importance of repro-
ducible experimental research in parallel computing,” Computing Re-
search Repository, vol. abs/1308.3648, 2013.
[13] A. Mohammed, A. Eleliemy, and F. M. Ciorba, “Towards the reproduc-
tion of selected dynamic loop scheduling experiments using SimGrid-
SimDag.” Poster at IEEE International Conference on High Performance
Computing and Communications, 2017.
[14] A. Eleliemy, A. Mohammed, and F. M. Ciorba, “Exploring the relation
between two levels of scheduling using a novel simulation approach,” in
Proceedings of 16th International Symposium on Parallel and Distributed
Computing, pp. 26–33, 2017.
[15] H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter,
“Versatile, scalable, and accurate simulation of distributed applications
and platforms,” Parallel and Distributed Computing, vol. 74, no. 10,
pp. 2899–2917, 2014.
[16] M. Balasubramanian, N. Sukhija, F. M. Ciorba, I. Banicescu, and S. Sri-
vastava, “Towards the scalability of dynamic loop scheduling techniques
via discrete event simulation,” in Proceedings of the 26th IEEE In-
ternational Parallel and Distributed Processing Symposium Workshops,
pp. 1343–1351, 2012.
[17] N. Sukhija, I. Banicescu, S. Srivastava, and F. M. Ciorba, “Evaluating
the flexibility of dynamic loop scheduling on heterogeneous systems in
the presence of fluctuating load using SimGrid,” in Proceedings of the
27th IEEE International Parallel and Distributed Processing Symposium
Workshops and PhD Forum, pp. 1429–1438, 2013.
[18] N. Sukhija, I. Banicescu, and F. M. Ciorba, “Investigating the resilience
of dynamic loop scheduling in heterogeneous computing systems,” in
27
Proceedings of the 14th International Symposium on Parallel and Dis-
tributed Computing, pp. 194–203, 2015.
[19] F. Hoffeins, F. M. Ciorba, and I. Banicescu, “Towards the reproducibil-
ity of using dynamic loop scheduling techniques in scientific applica-
tions,” in Proceedings of 16th International Symposium on Parallel and
Distributed Computing, 2017.
[20] A. Mohammed, A. Eleliemy, and F. M. Ciorba, “Performance reproduc-
tion and prediction of selected dynamic loop scheduling experiments.”
https://drive.switch.ch/index.php/s/5Ah1dpQb5SUU9ce, 2018. [On-
line; accessed 18 February 2018].
[21] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Par-
allel Computing: Design and Analysis of Algorithms, vol. 400. Ben-
jamin/Cummings Redwood City, 1994.
[22] B. Nichols, D. Buttlar, and J. Farrell, Pthreads programming: A POSIX
standard for better multiprocessing. O’Reilly Media, Inc., 1996.
[23] A. Sodani, “Knights Landing (KNL): 2nd generation Intel Xeon Phi
processor,” in Hot Chips 27 Symposium, pp. 1–24, 2015.
28
