Mississippi State University

Scholars Junction
Theses and Dissertations

Theses and Dissertations

8-14-2015

Performance Analysis and Evaluation of Divisible Load Theory
and Dynamic Loop Scheduling Algorithms in Parallel and
Distributed Environments
Mahadevan Balasubramaniam

Follow this and additional works at: https://scholarsjunction.msstate.edu/td

Recommended Citation
Balasubramaniam, Mahadevan, "Performance Analysis and Evaluation of Divisible Load Theory and
Dynamic Loop Scheduling Algorithms in Parallel and Distributed Environments" (2015). Theses and
Dissertations. 3494.
https://scholarsjunction.msstate.edu/td/3494

This Dissertation - Open Access is brought to you for free and open access by the Theses and Dissertations at
Scholars Junction. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of
Scholars Junction. For more information, please contact scholcomm@msstate.libanswers.com.

Performance analysis and evaluation of divisible load theory
and dynamic loop scheduling algorithms in
parallel and distributed environments

By
Mahadevan Balasubramaniam

A Dissertation
Submitted to the Faculty of
Mississippi State University
in Partial Fulfllment of the Requirements
for the Degree of Doctor of Philosophy
in Computer Science
in the Department of Computer Science and Engineering

Mississippi State, Mississippi
August 2015

Copyright by
Mahadevan Balasubramaniam
2015

Performance analysis and evaluation of divisible load theory
and dynamic loop scheduling algorithms in
parallel and distributed environments

By
Mahadevan Balasubramaniam
Approved:

Ioana Banicescu
(Major Professor)

Edward A. Luke
(Committee Member)

Edward B. Allen
(Committee Member)

Song Zhang
(Committee Member)

T. J. Jankun-Kelly
(Graduate Coordinator)

Jason M. Keith
Interim Dean
Bagley College of Engineering

Name: Mahadevan Balasubramaniam
Date of Degree: August 14, 2015
Institution: Mississippi State University
Major Field: Computer Science
Major Professor: Dr. Ioana Banicescu
Title of Study: Performance analysis and evaluation of divisible load theory and dynamic loop scheduling algorithms in parallel and distributed environments
Pages of Study: 185
Candidate for Degree of Doctor of Philosophy

High performance parallel and distributed computing systems are used to solve large,
complex, and data parallel scientifc applications that require enormous computational
power. Data parallel workloads which require performing similar operations on different data objects, are present in a large number of scientifc applications, such as N-body
simulations and Monte Carlo simulations, and are expressed in the form of loops. Data
parallel workloads that lack precedence constraints are called arbitrarily divisible workloads, and are amenable to easy parallelization. Load imbalance that arise from various
sources such as application, algorithmic, and systemic characteristics during the execution
of scientifc applications degrades performance. Scheduling of arbitrarily divisible workloads to address load imbalance in order to obtain better utilization of computing resources
is a major area of research.

Divisible load theory (DLT) and dynamic loop scheduling (DLS) algorithms are two
algorithmic approaches employed in the scheduling of arbitrarily divisible workloads. Despite sharing the same goal of achieving load balancing, the two approaches are fundamentally different. Divisible load theory algorithms are linear, deterministic and platform
dependent, whereas dynamic loop scheduling algorithms are probabilistic and platform
agnostic. Divisible load theory algorithms have been traditionally used for performance
prediction in environments characterized by known or expected variation in the system
characteristics at runtime. Dynamic loop scheduling algorithms are designed to simultaneously address all the sources of load imbalance that stochastically arise at runtime from
application, algorithmic, and systemic characteristics.
In this dissertation, an analysis and performance evaluation of DLT and DLS algorithms
are presented in the form of a scalability study and a robustness investigation. The effect
of network topology on their performance is studied. A hybrid scheduling approach is
also proposed that integrates DLT and DLS algorithms. The hybrid approach combines
the strength of DLT and DLS algorithms and improves the performance of the scientifc
applications running in large scale parallel and distributed computing environments, and
delivers performance superior to that which can be obtained by applying DLT algorithms in
isolation. The range of conditions for which the hybrid approach is useful is also identifed
and discussed.

Key words: data parallel workloads, arbitrarily divisible workloads, divisible load theory,
dynamic loop scheduling, scalability, robustness, parallel and distributed computing, 3D
torus, cluster

DEDICATION

To my wife, Padmabala Venugopal and kids, Keshav Hari Mahadevan and Aditya Rohan Mahadevan.

ii

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my major Professor, Dr. Ioana Baniecsu,
for giving me an opportunity to work under her guidance, and for the constant academic
and moral support throughout the course of the PhD program. I would like to thank my
committee, Dr. Edward B. Allen, Dr. Edward A. Luke, and Dr. Song Zhang for their
valuable contributions. I would also like to thank Dr. Florina M. Ciorba for the technical
advice, and Drs. Sivakumar Kulasekaran and Srishti Srivastava, for their help.

iii

TABLE OF CONTENTS

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2
1.3
1.4
1.5
1.6

Scientifc applications and their performance
Motivation . . . . . . . . . . . . . . . . . .
Problem statement . . . . . . . . . . . . . .
Objectives . . . . . . . . . . . . . . . . . . .
Method of evaluation . . . . . . . . . . . . .
Organization . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

1
5
6
7
8
8

2. BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . .

10

2.1
2.2

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

Data parallel workloads . . . . . . . . . . . . . . . . . . . .
Scheduling using divisible load theory (DLT) . . . . . . . . .
2.2.1
An illustration . . . . . . . . . . . . . . . . . . . . . .
2.2.2
Network topologies . . . . . . . . . . . . . . . . . . .
2.2.3
Multi-source scheduling . . . . . . . . . . . . . . . . .
2.2.4
Multi-round scheduling . . . . . . . . . . . . . . . . .
2.2.5
Result collection . . . . . . . . . . . . . . . . . . . . .
2.2.6
Applications . . . . . . . . . . . . . . . . . . . . . . .
2.2.7
Linear programming . . . . . . . . . . . . . . . . . .
2.2.8
Others . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Scheduling using dynamic loop scheduling (DLS) algorithms
2.3.1
Fixed size chunking . . . . . . . . . . . . . . . . . . .
iv

.
.
.
.
.
.

1

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

10
11
13
15
17
19
20
21
22
24
26
27

2.3.2
Guided self-scheduling . . .
2.3.3
Factoring . . . . . . . . . .
2.3.4
Factoring variants . . . . . .
2.3.5
Adaptive factoring . . . . . .
2.4
Robustness of scheduling algorithms
2.4.1
Static robust schedule . . . .
2.4.2
Robustness metric . . . . . .
2.5
Conclusions . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

28
28
29
30
30
31
34
35

3. A FRAMEWORK FOR PERFORMANCE EVALUATION . . . . . . . .

36

3.1
3.2

.
.
.
.
.
.
.
.

Performance evaluation . . . . . . . .
Performance evaluation environments
3.2.1
Bricks . . . . . . . . . . . . .
3.2.2
MicroGrid . . . . . . . . . . .
3.2.3
GridSim . . . . . . . . . . . .
3.2.4
SimGrid . . . . . . . . . . . .
3.3
Overview of SimGrid . . . . . . . . .
3.3.1
Computation model . . . . . .
3.3.2
Communication model . . . .
3.4
Simulator design . . . . . . . . . . .
3.4.1
DLT simulation sequence fow
3.4.2
System of linear equations . .
3.4.3
DLS simulation sequence fow
3.5
Conclusions . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

55

v

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

4. APPLICATION AND PLATFORM MODELING . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

36
38
38
39
39
40
40
45
45
49
50
52
53
54

Applications . . . . . . . . . . . . . . . . . . . . . . .
4.1.1
The embarrassingly parallel (EP) NAS benchmark
4.1.1.1
Algorithm . . . . . . . . . . . . . . . . .
4.1.1.2
Implementation . . . . . . . . . . . . . .
4.1.2
The integer sort (IS) NAS benchmark . . . . . . .
4.1.2.1
The sorting problem . . . . . . . . . . . .
4.1.2.2
Key generation algorithm . . . . . . . . .
4.1.2.3
Implementation . . . . . . . . . . . . . .
4.1.3
Applications with different CCRs . . . . . . . . .
4.2
Platforms . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1
Star . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2
Cluster . . . . . . . . . . . . . . . . . . . . . . .
4.2.3
3D torus . . . . . . . . . . . . . . . . . . . . . .
4.2.4
Fat-tree . . . . . . . . . . . . . . . . . . . . . .
4.3
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

4.1

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

55
56
56
57
59
59
59
59
61
63
63
64
65
67
68

5. A SCALABILITY STUDY OF DLT AND DLS ALGORITHMS . . . . .
5.1
5.2
5.3

Why scalability? . . . . . . . . . . . . . .
Strong scaling versus weak scaling . . . . .
DLT algorithms . . . . . . . . . . . . . . .
5.3.1
EP benchmark . . . . . . . . . . . .
5.3.1.1
Analytical modeling . . . . .
5.3.1.2
Effciency analysis . . . . . .
5.3.1.3
Fastest parallel execution time
5.3.1.4
Isoeffciency analysis . . . .
5.3.2
IS benchmark . . . . . . . . . . . .
5.3.2.1
Analytical modeling . . . . .
5.3.2.2
Effciency analysis . . . . . .
5.3.2.3
Fastest parallel execution time
5.3.2.4
Isoeffciency analysis . . . .
5.4
DLS algorithms . . . . . . . . . . . . . . .
5.5
Conclusions . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

70
71
72
72
74
74
76
77
78
80
81
83
84
85
95

6. ROBUSTNESS ANALYSIS OF DLT ALGORITHMS . . . . . . . . . .

97

6.1
6.2

Why Robustness? . . . . . . . . . . . . . . . . . . . . .
FePIA procedure . . . . . . . . . . . . . . . . . . . . .
6.2.1
Performance feature and perturbation parameters
6.3
Modeling perturbations . . . . . . . . . . . . . . . . . .
6.4
Robustness prediction . . . . . . . . . . . . . . . . . .
6.4.1
Theorem 1 . . . . . . . . . . . . . . . . . . . . .
6.4.2
Theorem 2 . . . . . . . . . . . . . . . . . . . . .
6.5
Simulation results . . . . . . . . . . . . . . . . . . . . .
6.5.1
Robustness analysis . . . . . . . . . . . . . . . .
6.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

70

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

97
98
99
101
105
107
108
118
124
126

7. EFFECT OF TOPOLOGY ON SCHEDULING . . . . . . . . . . . . . . 127
7.1
7.2
7.3

Routing and congestion . . . . . .
Star network . . . . . . . . . . .
3D torus network . . . . . . . . .
7.3.1
Dimension order routing .
7.3.2
Impact of congestion . . .
7.4
Fat-tree network . . . . . . . . .
7.4.1
Destination-mod-k routing
7.4.2
Impact of congestion . . .
7.5
How to alleviate congestion? . . .
vi

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

127
129
129
130
131
134
134
137
138

7.5.1
Multi-round divisible load scheduling . . . . . . . . . . . 139
7.5.2
Round robin (RR) dimension order routing . . . . . . . . . 143
7.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8. A HYBRID APPROACH TO DIVISIBLE LOAD SCHEDULING . . . . 149
8.1

Processor equivalence principle . . . . .
8.1.1
Star network . . . . . . . . . . . .
8.1.2
3D torus network . . . . . . . . .
8.2
Proof by induction for star networks . . .
8.3
Hybrid scheduling . . . . . . . . . . . .
8.4
Simulation results and analysis . . . . . .
8.4.1
Computation-bound application . .
8.4.2
Intermediate application . . . . . .
8.4.3
Communication-bound application
8.5
Conclusions . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

149
150
152
153
155
158
160
163
165
167

9. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS . . . . . . 170
9.1
9.2
9.3

Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and lessons learned . . . . . . . . . . . . . . . . . . .
Future research directions . . . . . . . . . . . . . . . . . . . . .

170
172
175

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

vii

LIST OF TABLES

2.1

Glossary of DLT notation . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2

Glossary of DLS notation . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.1

Task characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.1

Design of experiments to study the scalability of DLT algorithms . . . . .

73

5.2

Processor availability in numbers . . . . . . . . . . . . . . . . . . . . . .

86

6.1

Glossary of robustness notation . . . . . . . . . . . . . . . . . . . . . . . 100

6.2

Variations: bounds, average values, and extrema . . . . . . . . . . . . . . 106

6.3

Glossary of DLT notation for 3D torus topology . . . . . . . . . . . . . . . 107

6.4

Platform characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.5

Predicted values of load fractions and parallel execution time on a 3D torus
topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

110

6.6

Design of experiments to evaluate the robustness of DLT algorithms . . . . 118

6.7

Robustness analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.1

Design of experiments to study the impact of congestion . . . . . . . . . . 128

8.1

Design of experiments to evaluate hybrid scheduling . . . . . . . . . . . . 158

8.2

Time (in sec) to process a task on an equivalent processor - T cpeq . . . . . 159

viii

LIST OF FIGURES

3.1

Modular architecture of SimGrid . . . . . . . . . . . . . . . . . . . . . . .

42

3.2

Actual latency as a function of expected latency . . . . . . . . . . . . . . .

47

3.3

Actual bandwidth as a function of expected latency and expected bandwidth

48

3.4

Execution fow when the scheduler employs DLT algorithm . . . . . . . .

50

3.5

Execution fow when the scheduler employs DLS techniques . . . . . . . .

53

4.1

Sequential runtime of the real vs. the simulated EP NAS benchmark for
different problem sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

Sequential runtime of the real vs. the simulated IS NAS benchmark for
different problem sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.3

Illustration of a star topology with 4 compute nodes and 3 links . . . . . .

64

4.4

Illustration of a cluster with 4 compute nodes and 4 links . . . . . . . . . .

65

4.5

Illustration of a 4 × 4 × 4 3D torus with 64 compute nodes and 192 links .

66

4.6

Illustration of a 2-level fat-tree network with 4 compute nodes . . . . . . .

67

5.1

Predicted vs. simulated effciency of parallel solution of two EP problem
classes, D and E, on [256-8192] processors . . . . . . . . . . . . . . . . .

75

4.2

5.2

Predicted effciency plot for all EP problem classes on [256-8192] processors 76

5.3

Isoeffciency contours for all EP problem classes on [256-16384] processors

77

5.4

Predicted vs. simulated effciency of parallel solution of IS problem class
D, on [64-1024] processors . . . . . . . . . . . . . . . . . . . . . . . . . .

82

Predicted effciency plot for all IS problem classes on [64-1024] processors

82

5.5

ix

5.6

Isoeffciency contours for all IS problem classes on [64-1024] processors .

84

5.7

Constant processor availability - uniform distribution . . . . . . . . . . . .

87

5.8

Constant processor availability - exponential distribution . . . . . . . . . .

88

5.9

Variable processor availability - uniform distribution . . . . . . . . . . . .

89

5.10

Variable processor availability - exponential distribution . . . . . . . . . .

89

5.11

Performance of the DLS algorithms with constant iterations execution times
and constant processor availability (uniform distribution) . . . . . . . . . .

91

Performance of the DLS algorithms with constant iterations execution times
and variable processor availability (uniform distribution) . . . . . . . . . .

91

Performance of the DLS algorithms with Gaussian iterations execution times
and constant processor availability (uniform distribution) . . . . . . . . . .

92

Performance of the DLS algorithms with Gaussian iterations execution times
and variable processor availability (uniform distribution) . . . . . . . . . .

92

Performance of the factoring based DLS algorithms with exponential iterations execution times and constant processor availability (exponential distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

Performance of the factoring based DLS algorithms with exponential iterations execution times and variable processor availability (exponential distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

Performance of the factoring based DLS algorithms with exponential iterations execution times and variable processor availability (exponential distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.12

5.13

5.14

5.15

5.16

5.17

6.1

Left skewed variation generated with sin(x2 ) · cos(y 2 ) . . . . . . . . . . . 102

6.2

Right skewed variation generated with cos(x2 ) · cos(y 2 ) . . . . . . . . . . 103

6.3

Diagonal skewed variation generated with sin(x2 ) · sin(y 2 ) . . . . . . . . . 104

6.4

Non skewed variation generated with sin(x2 ) · cos(y 2 ) and cos(x2 ) · cos(y 2 ) 105

6.5

Predicted values of γRB for the pure computation application . . . . . . . .
x

112

6.6

Predicted values of γRB for the computation-bound application . . . . . . .

113

6.7

Predicted values of γRB for the intermediate application . . . . . . . . . .

114

6.8

Predicted values of γRB for the communication-bound application . . . . .

115

6.9

Predicted and simulation values of γR , γB , γRB for the pure computation
application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

Predicted and simulation values of γR , γB , γRB for the computation-bound
application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

119

6.10

6.11

Predicted and simulation values of γR , γB , γRB for the intermediate application120

6.12

Predicted and simulation values of γR , γB , γRB for the communicationbound application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.1

Performance degradation of DLT and AF algorithms due to congestion, on
3D torus networks for the computation-bound application . . . . . . . . . 132

7.2

Performance degradation of DLT and AF algorithms due to congestion, on
3D torus networks for the intermediate application . . . . . . . . . . . . . 133

7.3

Performance degradation of DLT and AF algorithms due to congestion, on
3D torus networks for the communication-bound application . . . . . . . . 134

7.4

Performance degradation of DLT and AF algorithms due to congestion, on
fat-tree networks for the computation-bound application . . . . . . . . . . 135

7.5

Performance degradation of DLT and AF algorithms due to congestion, on
fat-tree networks for the intermediate application . . . . . . . . . . . . . . 136

7.6

Performance degradation of DLT and AF algorithms due to congestion, on
fat-tree networks for the communication-bound application . . . . . . . . . 136

7.7

Bottleneck link in a fat-tree network . . . . . . . . . . . . . . . . . . . . . 137

7.8

Comparative performance of single-round and multi-round DLT algorithms
on 3D torus and fat-tree networks for the computation-bound application .

7.9

139

Comparative performance of single-round and multi-round DLT algorithms
on 3D torus and fat-tree networks for the intermediate application . . . . . 140

xi

7.10

Comparative performance of single-round and multi-round DLT algorithms
on 3D torus and fat-tree networks for the communication-bound application 141

7.11

Comparative performance of RR dimension order and XYZ routing schemes
on 3D torus networks for the computation-bound application . . . . . . . . 144

7.12

Comparative performance of RR dimension order and XYZ routing schemes
on 3D torus networks for the intermediate application . . . . . . . . . . . . 144

7.13

Comparative performance of RR dimension order and XYZ routing schemes
on 3D torus networks for the communication-bound application . . . . . . 145

8.1

A star network and the corresponding equivalent network . . . . . . . . . 150

8.2

A schematic diagram illustrating hybrid scheduling . . . . . . . . . . . . . 156

8.3

Performance of hybrid scheduling on a hybrid network comprised of star
topology for the computation-bound application . . . . . . . . . . . . . . . 160

8.4

Performance of hybrid scheduling on a hybrid network comprised of 3D
torus topology for the computation-bound application . . . . . . . . . . . . 161

8.5

Performance of hybrid scheduling on a hybrid network comprised of star
and 3D torus topologies for the computation-bound application . . . . . . 162

8.6

Performance of hybrid scheduling on a hybrid network comprised of star
topology for the intermediate application . . . . . . . . . . . . . . . . . . 163

8.7

Performance of hybrid scheduling on a hybrid network comprised of 3D
torus topology for the intermediate application . . . . . . . . . . . . . . . 164

8.8

Performance of hybrid scheduling on a hybrid network comprised of star
and 3D torus topologies for the intermediate application . . . . . . . . . . 165

8.9

Performance of hybrid scheduling on a hybrid network comprised of star
topology for the communication-bound application . . . . . . . . . . . . . 166

8.10

Performance of hybrid scheduling on a hybrid network comprised of 3D
torus topology for the communication-bound application . . . . . . . . . . 167

8.11

Performance of hybrid scheduling on a hybrid network comprised of star
and 3D torus topologies for the communication-bound application . . . . . 168

xii

NOMENCLATURE
α load fraction, a np tuple
AF Adaptive factoring
BoT Bag of tasks
CCR Communication-to-computation ratio (in bytes/FLOP)
DLT Divisible load theory
DLS Dynamic loop scheduling
F LOP s Floating point operations
F LOP S Floating point operations per second
np number of processors
nt number of divisible tasks
Tcm Time required to transfer a task over a network link (in sec)
Tcp Time required to compute a task on a processor (in sec)
Wt Computational effort required to process a task (in FLOPs)
Zt Communication effort associated with transferring a task (in bytes)

xiii

CHAPTER 1
INTRODUCTION

In this chapter, a short background of the related research areas relevant to this study is
presented, along with the motivating factors of this research. The problem statement and
the objectives are then defned, followed by the organization of this document.

1.1

Scientifc applications and their performance
High performance computing is a powerful platform that enables realization of com-

plex scientifc problems, such as climate modeling, particle physics, biology, and others, that require enormous computational power. It facilitates discoveries and promotes
progress in areas that would otherwise not be possible. Supercomputers and clusters are
examples of traditional high performance computing environments. Recently, grid computing has emerged as a mainstream of high performance computing. A grid platform is
a loosely coupled system and is highly distributed in nature. Although many commercial applications make use of high performance computing systems, traditional scientifc
applications remain among the most important users.
Scientifc applications that are expected to exploit the capabilities of high performance
computing systems must contain suffcient amount of parallelism. In many scientifc applications, such as, N-body simulations, Monte Carlo simulations, computational fuid dy1

namics applications, and others, loops without dependencies offer a rich source of parallelism. Such loops are called data parallel loops which contain workloads that often
require performing similar operations on different data objects simultaneously. Data parallel workloads that lack precedence constraints are amenable to easy parallelization and are
called arbitrarily divisible workloads. Load imbalances that arise during the execution of
arbitrarily divisible workloads from a variety of sources, such as application, algorithmic,
and systemic characteristics, is a major performance degradation factor. Scheduling of arbitrarily divisible workloads to address load imbalances in order to obtain better utilization
of computing resources is a major area of research, and is the focus of this research work.
The scheduling problem in general can be defned as a 3-tuple: the system model, the
nature of the workload, and objective function(s). In this 3-tuple, (1) the system model
defnes various system related attributes such as number of processors, interconnection
topology, and others, (2) the nature of workload relates to attributes such as arbitrarily
divisible workloads, and (3) the objective function can be commonly represented by the
minimization of the overall run time of a parallel application, commonly known as the
makespan minimization problem. The solution to the scheduling problem is a set of schedules that satisfy the objective function(s), possibly empty or infnite. In general, fnding an
optimal schedule is an NP-complete problem [56] and solutions exist only under restrictive
assumptions. The general taxonomy of scheduling is vast and an excellent classifcation of
scheduling algorithms is given in [31].
The divisible load theory (DLT) [21] offers a paradigm for scheduling arbitrarily divisible workloads in parallel and distributed environments. In addition to computation and
2

communication efforts associated with workloads, it also accounts for processor and link
speeds, and network topology. The scheduling model offered by DLT is linear and deterministic in nature, and provides a tractable linear equation (or recursive) solution to the
makespan problem. A large number of theoretical results are available on different networks such as star networks, linear networks, bus networks, and hypercube networks. The
DLT is analogous to queuing theory, and provides features such as equivalent elements and
infnite size networks. One challenge of the DLT is the incorporation of a rapid change
in network or processing state into the system model. Scheduling algorithms that employ
DLT are referred to as DLT algorithms. DLT algorithms have been employed in various
applications, such as, scientifc computing, image processing, database processing, and
others.
The dynamic loop scheduling (DLS) [12] algorithms are another commonly employed
methodology in the scheduling of arbitrarily divisible workloads found in many scientifc
applications. DLS algorithms are probabilistic in nature and are platform agnostic. The
network topology, and processor and link speeds are not explicitly modeled by the DLS
algorithms, and uses probabilistic analyses to derive the chunk sizes (collection of independent tasks) to be executed by processors, such that the chunks complete execution within
the optimal time with high probability. The power of DLS algorithms lies in the fact that,
they simultaneously address all sources of load imbalance that arise from application, algorithmic, and systemic characteristics. DLS algorithms have been traditionally applied
to achieve load balancing of scientifc applications, such as N-body simulations, computational fuid dynamics application, and others.
3

A study of the scalability of these algorithms is an important aspect of performance
analysis. Scalability of an algorithm is defned as its ability to provide performance proportional to resource usage. A scalability study can help answer questions, such as, how
an algorithm performs with increasing problem and system sizes, how an algorithm compares with other algorithms available for solving the same problem, and others. A scalability study can also provide more insight into the factors that impact the performance of
an algorithm, thereby helping to improve its quality. Scalable algorithms are needed for
sheer increase in problem sizes can quickly overwhelm even the most powerful computing
system. Although the effectiveness of DLT and DLS algorithms have earlier been demonstrated and reported in the literature, their scalability with respect to larger problem and
system sizes have only been pursued in a limited scope.
A study of the robustness of scheduling algorithms is another important aspect of performance analysis. According to [47], a robust scheduling system (algorithm) guarantees
a certain level of performance despite the fuctuations in the operating environment. Traditional performance metrics in high performance computing, such as parallel runtime, cost,
effciency, and others, provide valuable insights into the workings of a scheduling algorithm, but are not useful enough in identifying the degree of its robustness. For example,
the traditional metrics are not suffcient to answer questions such as, how much can the
network bandwidth decrease before the performance of a scheduling algorithm drops below an acceptable level. Questions of this sort and others related to it can be answered by
measuring the robustness of a scheduling algorithm. Although, the study of robustness of
scheduling algorithms is an active area of research, the robustness of DLT algorithms has
4

not yet been investigated. More recently, a robustness analysis of DLS algorithms has been
reported in [13], [71], and [72].

1.2

Motivation
Despite the presence of large volumes of work that apply DLT and DLS algorithms for

scheduling parallel applications, this research is motivated by the limited amount of work
that addresses the following topics:
1. Scalability analysis - scalability is an important attribute of an algorithm especially
in the context of ever increasing problem and system sizes. Scalable algorithms
are needed, because sheer increase in problem sizes can quickly overwhelm even the
most powerful computing system. The scalability nature of DLT and DLS algorithms
for larger problem and system sizes have only been pursued in limited scope.
2. Robustness investigation - a robustness study of an algorithm helps to understand
the effect of various perturbation parameters from the computing environment on the
performance of the algorithm. Such a study is useful to understand the behavior of an
algorithm and could be applied to answer questions, such as, the ones requesting the
environment conditions that may determine an algorithm to fail to deliver a certain
level of quality of service. Investigations on the robustness of DLT algorithms have
not been pursued before.
3. Effect of system topology - DLT and DLS algorithms have traditionally been evaluated in a master-workers paradigm on platforms, such as, star, linear array, hypercube, cluster of workstations, and others. Modern system topologies, such as, 3D
5

torus and fat-tree may have an impact on the performance of DLT and DLS algorithms. A study is required to evaluate the suitability of using these algorithms on
modern platforms.
4. Integrated scheduling approach - variations in the system characteristics are not
accounted for in the scheduling model of DLT and hence do not perform well under
such conditions. An implication of this defciency of DLT is that performance prediction becomes inaccurate since the optimality principle of DLT - in optimal load
distribution, all computing resources must fnish their computation at the same time
- is violated. It is possible to model time-varying system characteristics in DLT, but
it may also disrupt its linear and deterministic nature. Preserving its nature, and yet
accounting for unpredictable changes in the system characteristics is desirable, and
is not possible using DLT alone. Hence, there is a need for an integrated scheduling
approach that combines the use of DLT and DLS in a certain manner that can address
this defciency of DLT.

1.3

Problem statement
The performance of scientifc applications running in parallel and distributed com-

puting environments is improved by an integrated scheduling approach that employs both
divisible load theory and dynamic loop scheduling algorithms, and it is superior to the
performance when using scheduling with divisible load theory in isolation.
The applicability of DLT algorithms in environments characterized by unpredictable
variations in the system characteristics can be improved by combining them with DLS
6

algorithms. The performance of the integrated approach is evaluated on different network
topologies using applications with different communication-to-computation ratios (CCR).
The scope identifying the environmental conditions for which the integrated approach is
useful is also determined.

1.4

Objectives
The objectives of this research are to:

1. study the related areas of research, namely, the divisible load theory algorithms, the
dynamic loop scheduling algorithms, and the robustness of scheduling algorithms
to frame an in-depth understanding of the various essential issues present in these
related areas of research,
2. analyze the scalability of DLT and DLS algorithms using larger problem and system
sizes - addresses motivation 1,
3. investigate the robustness of DLT algorithms with respect to variations in various
performance impacting factors - addresses motivation 2,
4. study the effect of topology on the performance of DLT and DLS algorithms - addresses motivation 3,
5. develop an integrated scheduling approach that utilizes both the DLT and DLS algorithms - addresses motivation 4,
6. analyze and evaluate the performance of the integrated scheduling approach - addresses motivation 4, and
7

7. draw some conclusions from the lessons learned and propose further research avenues from the experience gained.

1.5

Method of evaluation
Discrete event simulation which is applied in various scientifc domains is also used

in this work for the following reasons: 1) facilitates easy access and consideration of computing platforms of different topologies, 2) offers the fexibility in simulating different
problem characteristics, 3) allows repeatability and control over system characteristics at
runtime which improves the confdence in the accuracy of the results obtained, 4) enables
the investigation of large problem and system sizes, and 5) reduces the time to perform
an experiment which in general, is a labor intensive process on a large scale computing
system.

1.6

Organization
The rest of this dissertation is organized as follows. In chapter 2, the related areas of

research, namely, the divisible load theory (DLT) algorithms, the dynamic loop scheduling
(DLS) algorithms, and the robustness of scheduling algorithms are surveyed in depth. In
chapter 3, various performance evaluation techniques and computing environments are
discussed, followed by the design of a simulation framework that is developed as a part
of this dissertation. In chapter 4, the modeling of the applications and the platforms used
in this study is described. In chapter 5, the scalability of DLT and DLS algorithms is
analyzed. In chapter 6, the robustness of DLT algorithms is investigated. In chapter 7, the
effect of network topology on the performance of DLT and DLS algorithms is evaluated.
8

In chapter 8, the design, analysis, and evaluation of an integrated scheduling approach are
presented. In chapter 9, the conclusions regarding the accomplishments along with the
perspectives on the possible future directions are outlined.

9

CHAPTER 2
BACKGROUND AND RELATED WORK

In this chapter, we present an in-depth survey of the related areas of research, namely
the scheduling of arbitrarily divisible workloads and the robustness of scheduling algorithms. We start with an important class of scientifc workload, namely the data parallel
workloads that lack precedence constraints, commonly referred to as arbitrarily divisible
workloads. We then discuss the two commonly employed algorithmic approaches used in
the scheduling of arbitrarily divisible workloads, namely the divisible load theory (DLT)
[21] and the dynamic loop scheduling (DLS) [12] algorithms. We conclude with the robustness of scheduling algorithms.

2.1

Data parallel workloads
In high performance computing (HPC), data parallel workloads often demand identical

processing of their data elements and are present in a wide range of scientifc applications,
such as N-body simulations, Monte Carlo simulations, CFD applications, image processing, pattern recognition, and others. Such workloads can be split into smaller fractions
and processed simultaneously, and thus are amenable to easy parallelization. However,
the degree of decomposition of a workload is determined by its divisibility nature. Arbitrarily divisible workloads lack precedence constraints, and hence can be partitioned into
10

any arbitrarily smaller fraction and can be processed simultaneously. Scheduling of arbitrarily divisible workloads to achieve load balancing is an active area of research in high
performance parallel and distributed scientifc computing. The objective of scheduling is
to effectively order and distribute the computations of the workload among the available
resources in order to achieve certain performance goal(s), such as minimizing execution
time, minimizing communication delays, maximizing resource utilization, and others [31].
A commonly studied scheduling problem is the makespan minimization problem where
the objective is to schedule tasks such that their overall completion time is minimized. Divisible load theory (DLT) and dynamic loop scheduling (DLS) algorithms, are two widely
employed algorithmic approaches in the study of scheduling of arbitrarily divisible workloads.

2.2

Scheduling using divisible load theory (DLT)
The divisible load theory was developed with the intent of providing the theoretical

tools needed for performance prediction of scheduling of workloads executing on various
system topologies. The system model consists of processing nodes (or processors) and
communication links, and is linear in nature. The optimality principle [21] forms the basis for divisible load scheduling. An optimal load distribution is obtained by solving a
set of linear (or recursive) equations, in which the unknowns represent the load fractions
assigned to each processor. However, it must be noted that the optimal schedule occurs
within the context of the system topology and hence the approach is not platform agnostic. The optimality principle guides the scheduling decisions in both the single and the
11

multi-installment strategies. In the single-installment strategy, the load is distributed in
one installment and in the multi-installment strategy, the load is distributed in multiple
installments. For the multi-installment strategy, the theory does not provide the means
for identifying the appropriate number of installments. In both the single and the multiinstallment strategies, the scheduling decisions are purely based on the processing speed
of the processors and the network communication delays, while no runtime characteristics
are considered.
Divisible load theory provides a simple and elegant mathematical framework for studying and analyzing divisible workloads that lack precedence constraints. The theory was
primarily developed for designing and analyzing load distribution strategies for arbitrarily
divisible workloads, and studying the trade-off relationship between communication and
computation. Some of the salient features and results of the theory are as follows:

1. Presenting a linear model - the system model for processors and communication
links in a network uses deterministic quantities and is linear. Similar to other linear
models, such as, electric circuit theory, queuing theory, and others, this yields a
fexible and a tractable analytical tool.
2. Deriving performance bounds - it is possible to obtain ultimate performance bounds
on linear and single-level tree networks using processor equivalence concept.
3. Applying the optimality principle - load distribution is based on this principle
which states that all processors must fnish their computation at the same time.

12

4. Determining an optimal sequence - on single-level tree networks, the performance
depends on the sequence of load distribution among the processors.
5. Employing load distribution strategies - both single and multi-installment schemes
are supported.

The tractable nature and the application of the DLT model to a wide range of interconnection topologies such as bus networks, tree networks, hypercube networks, and 2dimensional and 3-dimensional mesh of interconnected processors are cited as some of the
advantages of using the model [58].

2.2.1

An illustration

Consider a simple, three node star network with the root processor p0 distributing
the workload to the leaf processors p1 and p2 simultaneously using non-blocking mode of
communication. The links l1 and l2 connects the root processor p0 with the child processors
p1 and p2 , respectively. Let α1 and α2 be the load fraction (number of tasks) assigned to
processors p1 and p2 , respectively. Using a linear cost model for computation and an affne
cost model for communication, the following linear equations can be formulated:
T1 = L1 + α1 ∗ T cm1 + α1 ∗ T cp1

(2.1)

T2 = L2 + α2 ∗ T cm2 + α2 ∗ T cp2

(2.2)

α1 + α2 = n t

(2.3)

Equations (2.1) and (2.2) represent the fnish time of processors p1 and p2 , respectively.
Equation (2.3) represents the sum of the tasks processed by processors p1 and p2 is equal
13

Table 2.1
Glossary of DLT notation
Notation
wt
zt
nt , np , and nl
i, j
pi
Ri
T cpi
αi
Ti
Lj
Bj
Lkl
Bkl
T cmj
Tpar
α

Explanation
number of operations required to process a task (FLOPs)
size of a task (bytes)
number of arbitrarily divisible tasks, processors, and communication
links, respectively
indices, 0 ≤ i < np and 0 ≤ j < nl
processor i
rating of processor pi (FLOPS)
time to compute a task on processor pi - ratio of wt to Ri (seconds)
load fraction of processor pi
execution time of processor pi (seconds)
latency of link lj (seconds)
bandwidth of link lj (bytes/second)
end-to-end latency between processors pk and pl
end-to-end bandwidth between processors pk and pl
time to transfer a task using the link lj - ratio of zt to Bj (seconds)
parallel execution time (seconds)
load distribution: a (np ) ordered tuple (α0 , α1 , · · · αnp −1 )

14

to the total number of tasks. The fnish time includes both the computation and the communication costs associated with processing the workload. DLT partitions the workload
according to the optimality principle which states that for optimal load allocation, processors p1 and p2 must fnish their computations at the same instant, such that T1 = T2 = Tpar .
The three linear equations can be solved for the three unknowns (α1 , α2 , and Tpar ). In general, there will be (np + 1) linear equations processors in the system where np represents
the number of processors in the system. In the next subsections, we present a survey of the
progress in the feld of DLT based on the nature of the research.

2.2.2

Network topologies

A closed-form expression for the overall runtime of the problem on a bus network is
provided in [23]. This was the frst work to consider processor release time which is defned as the time between the instance when the workload arrives on the system and when
a processor becomes available for processing the workload. It was assumed that processor
release times are identical, and computation and communication can be overlapped. An
algorithm (called the scatter algorithm) for scheduling divisible loads on a 3-dimensional
mesh of processors in a message passing environment with circuit switch routing is presented in [35]. The scatter algorithm works by activating (p + 1) processors in each step
where p represents the number of communication ports. The maximum number of moves
in a system of n processors is given by logp+1 n. The scatter algorithm can handle up to 5
communication ports. A closed-form expression for the load fractions is also presented.

15

The problem of divisible load scheduling on a single-level tree networks with buffer
constraints was studied in [52]. The authors showed that sequencing - the order in which
the workload is distributed - has an impact on the overall performance of the system. In
an earlier work [23], it was shown that sequencing does not have an impact on the overall
system performance in case of infnite buffer capacity. A more realistic model of communication and computation for bus networks was proposed in [76] where other forms of
delays, such as protocol processing and de-packing of data were included in the system
model. It was shown, that in the presence of these delays, sequencing has an impact on the
overall system performance.
A product form solution for scheduling of divisible loads on a multi-level tree network
is presented for the frst time in [60]. The basic strategy is to traverse the multi-level tree in
a bottom-up fashion to estimate the load fractions at the root of each sub-tree, and use this
information to estimate the load fractions at the leaf level nodes and all the interior nodes.
It is assumed that the load originates at the root of the multi-level tree, and all the nodes in
the tree participate in the processing of the workload. The authors credit the linear nature
of the DLT model for the product form solution.
The problem of scheduling multiple divisible tasks on bus networks was reported in
[43]. All divisible tasks are assumed to be located in a central processor known as the
control processor. The control processor is responsible for scheduling the tasks and and
does not take part in the processing of the tasks. Since the release time of a processor
is taken into account, the DLT linear equations involve a max function which increases
the complexity of solving the linear equations. The authors provide a novel algorithm
16

for solving the DLT equations. The algorithm works by assuming the release times of
processors as zero, and then iteratively modifying the initial solution to include the release
times of the processors such that the makespan is minimized.

2.2.3

Multi-source scheduling

The problem of divisible load scheduling with load originating from multiple sources
was frst studied in [79]. The system model consists of N sources and M sinks, with a
direct link connecting a source with all the sinks. A centralized scheduler resides in one
of the sources and gathers information from the other sources regarding the size of the
loads. The scheduler then calculates the load fraction and notifes each source to give an
optimum of load to every sink. A variation of the scheduling algorithm to handle buffer
capacity constraints at the sinks is also presented. Multi-source scheduling of divisible
loads using the minimum cost and multi-commodity fow formulations is studied in [59].
In the minimum cost fow problem formulation, the cost of a link is proportional to the
amount of the fow through that link. The load can originate at any node (called the source)
and can be processed at any node (called the sink) in the fow network. The objective is
to minimize the cost of the fow in the network. The multi-commodity fow formulation is
similar to the minimum cost fow formulation, but allows specifc classes of workloads to
be processed at specifc processors. Since DLT is a linear theory, the authors also discuss
the application of superposition principle on a linear daisy chain with two load sources at
each end of the network.

17

The problem of scheduling tasks in a multi-source, multi-sink environment is also studied in [78]. A cluster is modeled as a set of source nodes, set of sink nodes, and a coordinator node (CN) which is responsible for the scheduling process. The CN obtains the
workload information along with its deadline requirements from the source node, and the
buffer capacity and the processing speed of the sink nodes. Scheduling of workloads is
done in multiple iterations, and within each iteration, only the smallest fraction of the
workload that can be processed by any single sink node is given to all sink nodes. This ensures that at any given iteration, the processing of the workload fnishes at the same time. It
is possible to include new sources into the system at any time. The CN also runs an admission control policy, where by, a new source is added to the system, only if the workload can
be processed within its deadline requirements based on the buffer availability at the sink
nodes. This work is pull-based strategy and is different from [79] which is a push-based
strategy.
Multi-source divisible load scheduling in a single-level tree network is studied in [53].
The root node of the network acts as a scheduler and does not participate in computation,
and the child nodes are the source of the workload. The load distribution strategy works
by partitioning the child nodes of the network into two sets, namely, the sender set and the
receiver set. The sender set consists of all nodes that has more than the ideal workload
quantity and the receiver set consists of all nodes that has less than ideal workload quantity
such that the overall processing time is minimized. The ideal workload quantity required
for optimal processing time is obtained by dividing the overall workload by the overall

18

processing power. The root node gathers the excessive workload from the sender set and
distributes them to the receiver set.

2.2.4

Multi-round scheduling

Multi-round scheduling algorithms for divisible loads is presented in [80]. A masterslave algorithm called the Uniform Multi-Round algorithm (UMR) for load distribution is
proposed. It is assumed that the cost of computation and communication follows an affne
cost model where the computation and communication has a fxed start-up cost in addition to the cost of processing (or communicating) the tasks. The master has a single port
for communication and hence the communication from the master to the slave processors
proceed in a sequential fashion. For homogeneous platforms, the chunk size per round is
kept fxed, while for heterogeneous platforms, the time required to compute the chunks is
kept fxed. In both the cases, there is a constrained optimization problem which was solved
using the Lagrange multiplier method.
A statistical approach to the study of scheduling of divisible loads is presented in [39].
The approach, called as the task farm approach, is a master-slave approach that distributes
tasks to the workers based on their ftness index F which estimates the task size on a pernode basis. The ftness index F is periodically updated to include the latest performance of
a node. Based on the timings obtained using the probing technique, the authors provide a
formula to estimate the number of installments to be used, which remains fxed.
A method for scheduling divisible workloads on an heterogeneous two-level tree network using a multi-installment strategy is proposed in [64]. The basis of the proposed
19

approach is based on the optimality principle. The proposed approach attempts to fnd the
workload fraction assigned to each branch of the tree such that all they all fnish at the same
instant. The solution which includes the number of installments and the load fraction in
each installment of the processors is obtained by solving a Quartic equation. It is assumed
that blocking mode of communication is used.

2.2.5

Result collection

In traditional DLT, the problem of scheduling of results back to the originating processor (result collection) is not addressed. The problem of divisible load scheduling with
result collection in a star interconnection network is investigated in [38]. The processors in
the network are arranged in the order of increasing communication time. The load distribution algorithm works by identifying the two processors with fastest communication links,
and through the application of the processor equivalence principle, replaces them with an
equivalent processor (step-a). If the resulting schedule is optimal (all processors fnish their
computations at the same time), then the algorithm exits. Otherwise, the algorithm executes step-a until an optimal schedule is found or all the processors are replaced with the
equivalent processors. The algorithm assumes there is only one port for communication,
and communication and computation cannot occur simultaneously.
The problem of returning results back to the originating node in the context of homogeneous nodes connected via heterogeneous links was studied in [18]. Two protocols for
returning results, namely, the LIFO (Last In First Out), and the FIFO (First In First Out)
protocols were analyzed. It was observed that neither protocol dominates the other, but
20

FIFO protocol was found to be approximately optimal. A schedule generated by the FIFO
protocol was within a predictable bounds from the optimal schedule. In a general case, the
problem of fnding an optimal schedule is an open problem.

2.2.6

Applications

The use of the DLT on real world problems such as pattern matching, fle compression,
database join operation, graph coloring in clusters of workstations was reported in [37]. In
some cases, a difference of up to ≈ 30% between the model and the experimental results
was observed. The large difference was attributed to other factors such as, accessing disk
fles, message size greater than the available memory size, and others. In other cases,
the difference was less than ≈ 10%. In general, when the computing platform was more
uniform and dedicated, the divisible task model was found to be more accurate.
A survey of research in the feld of DLT is described in [22]. The load distribution
model of DLT was explained for networks such as, linear array, tree, and bus networks.
Application of DLT to problems such as, large matrix-vector products, effcient movie
retrieval for network-based multimedia systems, and others is also discussed. The use
of DLT in a grid computing environment is reported in [81]. A high energy and nuclear
physics problem, namely, the STAR detector (solenoidal tracker at relativistic heavy ion
collider) was considered. DLT was applied in the event reconstruction phase, where the
input data generated by real experiment or simulation, was converted into a list of particles
that possess certain properties such as momentum, electric charge, and others.

21

MapReduce [34] computations are studied as a divisible load model in [19]. MapReduce is a programming model for processing large volumes of data sets on large number
of computers [34]. MapReduce computations consists of two phases, namely mapping and
reducing. In the mapping phase, a map function processes the input data set and converts
them into intermediate results in the form of (key,value) pairs. In the reducing phase, a
reduce function processes the intermediate results, and merges keys of equal values and
produces results of the form (key,value) pairs. Thus reducing phase is dependent on the
mapping phase. An example of a problem that can be modeled as MapReduce computations is counting the occurrence of words in a large data set [34]. In [19], a heuristic
for load partitioning for MapReduce computations (m map tasks and r reduce tasks) is
presented. The basic idea is to partition the load such that when processor pi fnishes its
portion of the map computation and a reducer reads that result, the next process pi+1 fnishes its portion of the map computation. It was observed that the reduce operation could
become a bottleneck in the performance of MapReduce computations.

2.2.7

Linear programming

A different formulation of the divisible load scheduling was proposed in [17], wherein
the scheduling problem was formulated as a linear programing problem, with the objective
of maximizing the throughput (number of tasks processed per unit time) of the steady state
computational phase. The computing and the communication resources were modeled as
a tree where the weight of each node represent the speed of the computing resource, and
the weight of a link represent the communication cost. A bandwidth-centric allocation
22

strategy - a workload distribution strategy that allocates workload to computing resources
in the order of increasing communication time - was proposed as an optimal solution. The
proposed allocation strategy however requires a global knowledge of the weights of all the
nodes and the links, which is not always practically feasible. The tree is static in nature
and cannot grow dynamically if new resources are available. These defciencies were later
addressed in [50], where the authors propose an autonomous protocol, wherein, each node
makes scheduling decisions based on the locally available information. The authors claim
that the autonomous nature of the protocol promotes scalability, and hence, it is possible to
dynamically grow the tree model. Two versions of an autonomous protocol were proposed,
namely, the interruptible and non-interruptible communication. In the interruptible communication model, a communication between a parent and a child node can be interrupted
by a higher priority child node, whereas, in the non-interruptible communication model,
such an interruption is not possible. Interruptible communication model was found to be
better because in the non-interruptible version, the number of buffers needed at each node
can be large and can yield low performance in case of trees with higher communicationto-computation ratios.
The complexity of scheduling divisible loads on heterogeneous star-shaped platforms
with limited memory constraint was studied in [16]. It was shown that with the affne
cost model, both single and the multi-round divisible load scheduling to be NP-Complete
when memory is a constraint. Scheduling divisible workloads using master-slave paradigm
on heterogeneous platforms with a bounded multi-port model was studied in [15]. In the
multi-port model, the master node can simultaneously communicate with the slaves nodes
23

as long as the outbound bandwidth does not exceed the maximum limit. Through a numeric
example, it was shown in the case of multi-port model, optimal sequencing is only valid for
one-port model. It was also shown that when all the processors take part in the computation
and in the absence of the start-up costs for both computation and communication, the
bounded multi-port model is an NP-Complete problem.

2.2.8

Others

The use of strategyproof mechanism to augment DLT on a bus network using a linear
cost model for processors was frst reported in [41]. The strategyproof mechanism is a
game theory concept, where the involved agents (such as the processors in the network)
have some privately known information such as, the computing power of the processor. The
agents are encouraged to report truthful information by providing them with incentives to
do so. Hence, the agents will achieve maximum proft only if they report truthfully. The
goal is to design a strategyproof mechanism that maximizes an objective function (such as
makespan minimization) which involves identifying an allocation algorithm and a payment
scheme.
While in [41], DLT was modeled using non-cooperative game theory, in [28], it was
modeled using cooperative game theory. The central idea is to maximize the value of the
coalition, which is defned as the payoff for processing a job minus the processing cost
of the job. The payoffs were shared among the participants using the Shapley value. The
workload allocation is done by the master processor which does not participate in the com-

24

putation. Single port model was assumed where the master processor can communicate
with only one other processor at any given time.
The use of isoeffciency maps as a visualization technique for the study of divisible
loads is presented in [36]. On a homogeneous single-level tree network, it can be shown
that effciency is a function of 5 parameters, namely, the number of processors used (m),
the problem size (V), speed of the processor (A), speed of the interconnection network (C),
and the constant start-up time (S) in the affne cost model for both computation and communication. Two-dimensional projections of the isoeffciency surfaces for various values
of A, C, S are given. One use of these maps is that it can be used to separate feasible
combinations of parameters from the infeasible ones.
A method for scheduling divisible workloads in a non-dedicated heterogeneous single
level tree network is proposed in [65]. The knowledge of the following is assumed to be
known in advance and do not vary during the course of the computation: (1) probability of
failure in the network links, (2) time required to repair a network link failure, (3) probability
of failure in the processing elements, and (4) time required to repair a processing element
failure. From the standard DLT equations based on the optimality principle, the delay that
can be introduced by the failure in the processor and the network link is accounted for.
Blocking mode of communication between the processing elements.
The problem of scheduling second-order workload is studied in [74]. The processing
time of second-order workloads is a non-linear function (n) of workload size. Hence, this
type of workloads results in a set of non-linear equations. In general, solving the equations
for any n is a computationally expensive process. Hence, only second-order workloads
25

(n=2) was considered. By assuming that the ratio of communication to computation time is
very small, and considering only the frst-order approximation of a Taylor series expansion,
a closed form expression for the load fraction was derived. It was also shown that the
conditions for optimal sequencing and optimal arrangements for second-order workloads
are the same as that of the frst-order workloads. This type of workload is encountered in
aerospace applications and pattern recognition problems [74].
An iterative algorithm for scheduling divisible workloads in grid environment is presented in [3]. Starting with an initial load distribution, the algorithm works by iteratively
fne-tuning the solution until the optimality criterion is met. In any iteration, if any node in
the network spends more than the average of all the completion time of all the nodes, then
workload is redistributed.

2.3

Scheduling using dynamic loop scheduling (DLS) algorithms
Loops without data dependency offer a rich source of parallelism in many scientifc

applications such as, N-body simulations, Monte Carlo simulations, and others. Loop
scheduling algorithms were developed to effectively schedule such loops with the goal of
reducing the load imbalance that can be caused by various application, algorithmic, and
systemic variations. Loop scheduling algorithms are based on probabilistic analyses, and
the chunk size (collection of loop iterations) is estimated such that it has a high probability
of fnishing within the optimal time. The simplistic strategy (static chunking) of allocating
N
P

tasks to each processor is non-optimal when loop iterations have variable execution

times or if the processors are heterogeneous, since this may result in uneven processor
26

fnishing times. Another simplistic strategy is self-scheduling where iterations are assigned
to processor one by one. This strategy results in even processor fnishing times, but the
good load balancing achieved comes at the cost of high scheduling overhead. Glossary of
DLS notation is provided in Table 2.2.
Table 2.2
Glossary of DLS notation
Notation
h
R
wi
µi
σi

2.3.1

Explanation
scheduling overhead
number of remaining tasks
weight of processor pi
mean iteration execution time on processor pi
standard deviation of iteration execution time on processor pi

Fixed size chunking

Kruskal and Weiss argued that the optimal allocation strategy lies in between the two
extremes (static chunking and self-scheduling), and proposed fxed-size chunking (FSC)
[51]. The system model considered is a queuing theory model, where the queue consists of
nt tasks and np processors to service them. Each task has a service time that follows certain
cumulative distribution function. The processors select a fxed number of tasks from the
queue which incurs a certain overhead h. Using this model, the formula for estimating the
fxed chunk size which is a function of nt , np , h and σ is given by Equation (2.4).
√
p
F SC chunk size = (( 2 · nt · h)/σ · np · lognp )2/3
27

(2.4)

Fixed-size chunking works well when the tasks service time follows a certain distribution,
such as uniform, normal, exponential, and when h is constant.

2.3.2

Guided self-scheduling

Guided self-scheduling (GSS) [57] was proposed to augment self-scheduling and reduce the scheduling overhead by guiding the amount of work allocated to the processors.
The fundamental principle by which chunks are allocated to a processor is to consider that
the remaining processors will also be scheduled at the same time. GSS schedules d nRp e
tasks during each scheduling operation. Though this technique achieves an optimal allocation in certain cases, it does not work well when most of the work is present in the front,
for instance, when the frst few scheduling operations can schedule the bulk of workload
thereby overloading the respective processors. The formula to calculate GSS chunk size is
given by Equation (2.5).
GSS chunk size = dR/np e

2.3.3

(2.5)

Factoring

Factoring (FAC) [45] like GSS is a decreasing sized chunk scheme and was proposed
to handle variable length iterations. Factoring schedules chunks in batches of np equal
sized chunks and the number of iterations per batch is always a fxed factor of the remaining iterations. The batch size is estimated such that it has a high probability of fnishing
within the optimal time. This rule implies that a batch size should be at most half of the
remaining iterations. The batch size is a function of np , R, µ, and σ. In practice, it may
not always be feasible to estimate µ and σ, thus the batch size is set to half the size of the
28

remaining iterations. The formula to calculate the FAC batch and the chunk size is given
by Equation (2.6) and Equation (2.7), respectively.

2.3.4

F AC batch size = 0.5 · R

(2.6)

F AC chunk size = dF AC batch size/np e

(2.7)

Factoring variants

Weighted Factoring (WF) [44] is a variant of Factoring and was specifcally designed
for heterogeneous environments. The batch size is estimated according to Factoring rules,
while the processors are assigned chunk sizes proportional to their processing speed. Adaptive Weighted Factoring (AWF) [14] is an evolution of Weighted Factoring and targets
time-stepping applications. During each time step, the weights of the processors are adjusted to refect not only the performance of the current time step, but also their cumulative
performance up to that time step. Variants of AWF were also developed that replace the
time stepping requirement for updating the weights of the processor in order to use AWF
[27] with a different requirement. Batched AWF (AWF-B) schedules tasks in batches just
as AWF, while timings from previous batches are used to update processor weights unlike
AWF which uses timings from previous time steps. In chunked AWF (AWF-C), a new
chunk size is computed every time a processor requests work. This strategy allocates more
work to faster processors from all remaining iterations, unlike Factoring, WF, and AWF-B,
where faster processors are allocated work from the remainder of the current batch. The
formula to calculate WF chunk size is given by Equation (2.8).
W F chunk size = wi · (F AC chunk size)
29

(2.8)

where F AC chunk size is given by Equation (2.7).

2.3.5

Adaptive factoring

Adaptive Factoring (AF) [14] is an evolution of the Factoring technique where the
assumption of having the same values for µ and σ of the iteration execution times for all
the processors is relaxed, and they are dynamically estimated after each chunk’s execution.
The estimates are updated after scheduling each iteration to refect the current state of the
system and are used to calculate the chunk sizes. The theoretical model used in AF is by
far more realistic than the one in other models, and hence, in general AF is expected to
provide better performance than other loop scheduling algorithms especially in the presence of large algorithmic and systemic variances, which is also confrmed by the reported
experimental results. The formula to calculate AF chunk size is given by Equation (2.9).
AF chunk sizepi = (D + 2 · T · R −

√
D2 + 4 · D · T · R)/2µpi

(2.9)

np
np
X
X
2
where D =
(σpi /µpi ), and T = 1/( (1/µpi )
i=1

2.4

i=1

Robustness of scheduling algorithms
High performance, parallel and distributed systems operate in an uncertain environ-

ment, such as, unexpected variation in the workload, resource failures, and others. However, such systems are expected to satisfy certain quality of service (QoS) despite operating
in such an environment. In other words, such systems are expected to be robust with respect to variations in the performance affecting parameters. Several defnitions exist for
a robust system. According to [47], a robust system guarantees a certain level of perfor30

mance despite the fuctuations in the operating environment. The study of robustness is an
active area of research. In this section, we survey the progress in the feld of robustness
with respect to robust schedule generation, robustness metric and framework to study the
robustness of resource allocation design, and the application of the robustness framework
to study the robustness of DLS algorithms.

2.4.1

Static robust schedule

A slack-based approach that assigns some slack to entities that can be scheduled so
that they can absorb some level of uncertainty without the need for rescheduling is investigated in [33]. Two time based techniques were proposed, namely, the time window slack
(TWS) and the focused time window slack (FTWS). In TWS, each scheduling entity always has a certain amount of slack. In FTWS, the slack of a scheduling entity is based on
when it is scheduled and entities that are scheduled later has more slack. A static heuristic for scheduling applications on multiple machines that meets certain quality of service
(QoS) constraints is presented in [9]. The proposed heuristic, called the Duplex heuristic, makes use of a generalized robustness metric proposed in [8] and considers two constraints, namely, the latency constraint and the throughput constraint. The Duplex heuristic
simply executes two existing heuristics, namely the Most Critical Task First (MCTF) and
the Most Critical Path First (MCPF) and identifes a schedule that has higher robustness.
The MCTF heuristic is expected to perform well when the throughput constraints are more
stringent than the latency constraints, and the MCPF heuristic is expected to perform well
otherwise. A comparative study of greedy heuristics vs iterative algorithms for scheduling
31

applications on multiple machines that meets certain quality of service (QoS) constraints
is reported in [7]. Two iterative algorithms, namely, a genetic algorithm and simulated
annealing were considered and iterative algorithms were found to perform better than the
greedy heuristics.
Static heuristics for scheduling periodic applications in a shipboard environment is presented in [61]. Different heuristics such as mapping computationally intensive applications
frst, allocating resources based on the worth of the sequence in which the applications
are executed, and others were proposed. The proposed heuristics consider two constraints,
namely, the throughput constraint and the latency constraint. Any schedule where the overall utilization of the computation and the communication resources is below the system’s
capacity, and satisfes the QoS constraints is called a feasible schedule. The use of greedy
approaches to generate statically robust schedules for periodic sensor driven distributed
systems is presented in [63]. The use of evolutionary algorithms such as steady state genetic algorithm, ant colony optimization, and simulated annealing to generate statically
robust schedules for periodic sensor driven systems is presented in [62]. The problem of
robust resource allocation in weather data processing systems is studied in [55]. The goal
of a schedule is to maximize the robustness of the system by minimizing the makespan
of the high priority tasks and maximizing the overall worth of the medium and the low
priority tasks. The worth of a task is defned as the product of the priority of the task
and the likelihood of completing before the arrival of the next data set. Several heuristics
such as minimum execution time, minimum completion time, two phase heuristics such as
Min-Min, and Max-Min based on the completion time of the tasks were considered.
32

A method for measuring the robustness of resource allocations in a distributed, heterogeneous platform servicing a high volume of web requests is presented in [66]. Stochastic
robustness metric (SRM) defnes the probability that all pending and currently executing
tasks at time t will meet their deadlines. The dynamic SRM which is defned as the average
of the instantaneous SRM values is used to compare the robustness of resource allocation
schemes. The objective of a resource allocation scheme is to minimize the overall cost,
where the cost is a penalty for not processing a task within a specifed time. Three heuristics, namely, two phase greedy, segmented two phase greedy, and negotiation heuristics
were compared using dynamic SRM. A heuristic for robust resource allocation in a cluster
based imaging system is presented in [68]. The resource allocation scheme follows the
minimum completion time heuristic and was shown to perform better than two commonly
used heuristics, namely, the round-robin heuristic and the random assignment heuristic.
A decentralized market-based resource allocation in an heterogeneous computing system is presented in [67]. The system model consists of enterprise service bus (ESB) components, service requesters, and service providers. The ESB architectural model is used
in design of applications that rely on service oriented architectural (SOA) model. A primary function of a ESB component is to route a service request to a corresponding service
provider. The ESB components advertise the price associated with each component along
with the price associated with a link that connects a component to a service provider. Each
service request has a priority associated with it, and each service requester requests services at a certain rate. Each processed service request also has a worth associated with

33

it. Each service requester attempts to maximize the worth of its requests by taking into
account the cost associated with it.

2.4.2

Robustness metric

Probabilistic guarantees for fault-tolerant real-time systems are addressed in [25]. The
authors derive a model that identifes the maximum fault frequency the system can handle
without violating any real-time constraint. The constructed model is used to derive the
probability that the system will not experience faults at a rate greater than the maximum
fault frequency. The fault model can handle both software and hardware transient failures.
Deriving a metric for the robustness of the makespan scheduling problem is addressed
in [24]. The authors propose integrating the robustness analysis into the design of the
scheduler. The scheduler constructs a set of iso-schedules which are schedules with a
same cost. Some schedules within the set of iso-schedules are less sensitive to variations
(hazards) than others and are called shifted-schedules. A shifted-schedule is considered a
robust schedule. An alternative defnition for a robust schedule is the number of critical
components in a schedule, with a lower number indicating a more robust schedule. An
entropy is associated with a schedule that defnes the probability of a schedule becoming a
critical one. Computing the entropy of a schedule is non-trivial in a general case.
A general methodology for deriving a robustness metric of an allocation scheme is presented in [8]. The method, named as FePIA procedure, stands for performance features,
perturbation parameters, impact, and analysis, is a four-stepped procedure: (i) the performance features that are of interest are described quantitatively, (ii) the perturbation param34

eters that impact the performance features are identifed, (iii) the impact of the perturbation
parameters on the performance features is identifed, and (iv) the smallest variation in the
perturbation parameters that will violate the robustness requirement is determined. The
robustness metric of an allocation scheme is the minimum of all the smallest variation in
the perturbation parameters that violates the robustness requirement.

2.5

Conclusions
Scheduling of arbitrarily divisible workloads and robustness evaluation of scheduling

algorithms are active research areas and forms the basis for this research. In this chapter,
an in-depth survey of the related areas of research was presented. Upon a careful review of
the related literature, we identifed that the following topics for further investigation:
1. scalability of DLT/DLS algorithms - a useful and a necessary study especially in
the context of large scale problem and system sizes
2. robustness of DLT algorithms - necessary to understand the fexibility of DLT algorithms with respect to variations in the performance impacting factors
3. comparative study of DLS and DLT - a comparative study will help to identify
the relative strengths and weakness of both the approaches, and the results could
potentially be applied towards building a hybrid approach that could augment each
others’ strengths.

35

CHAPTER 3
A FRAMEWORK FOR PERFORMANCE EVALUATION

In this chapter, we frst discuss the various evaluation techniques available for performance evaluation and then present the criteria for choosing simulations as our evaluation
technique. We then discuss the various evaluation environments available and elucidate the
reasons for choosing the SimGrid [29] simulation framework. We conclude this chapter
with the design of a simulator based on SimGrid that will be employed in this research.
The simulator we develop is one of the main contributions of this research.

3.1

Performance evaluation
Performance evaluation refers to the process of systematically evaluating the perfor-

mance of a concept under study in order to gain insight into the concept in addition to
investigate where further improvements are possible. Three commonly employed performance evaluation techniques are: (1) measurement, (2) simulation, and (3) analytical
modeling [46]. Measurement involves running experiments on a real computing platform
and in principle, can demonstrate the feasibility of the concept under study. A simulation
model can provide better insights by allowing for the concept to be studied under a wider
variety of workloads and an analytical model can provide the best insight when the effects
of different parameters and their interactions are required [46]. Choosing an appropriate
36

evaluation technique is a key step in the performance evaluation process and it is dependent
on the needs of the study. The following criteria are essential to this research study and are
evaluated to choose the relevant evaluation technique.

1. Access to different computing platforms - is an important criterion. Easy access to
a wide variety of computing platforms is essential to have high confdence on the validity and applicability of the proposed concept. Obtaining access to platforms with
different interconnection topologies such as star network, hypercube, clusters, 3D
torus, and others, is diffcult. However, it is relatively simpler to simulate different
platforms provided the simulation framework allows for it.
2. Control over system parameters - is an important criterion that can affect the accuracy of the results obtained. Comparing two techniques when operating in different
systemic environments can lead to misleading results. On a real computing platform
it is diffcult, if not impossible, to have precise control over the system parameters
such as system load, network traffc, and others. A simulated system on the other
hand provides tighter control over system parameters, such that no unexpected workload can enter into the simulated system.
3. Repeatability of experiments - another important criterion that can increase the
confdence in the accuracy of the results obtained. Having control over the system
parameters leads to the repeatability of the experiments. Conducting experiments at
different times with the same set of system parameters should yield the same results.
This is easier to achieve in simulation.
37

4. Time required - a useful criterion. The time to conduct experiments on a real computing platform can be large especially when large number of experiments are required. We believe, simulation experiments can run faster than running experiments
on a real computing platform.

Based on the above criteria and the advantages simulations have over measurements
with respect to them, we choose simulations as our evaluation technique. We will also
leverage the use of the linear and the deterministic model offered by DLT and compare the
simulation results with the results predicted by the analytical model.

3.2

Performance evaluation environments
In this study, an evaluation environment refers to a framework that allows for the

modeling of the host CPU, operating system components, storage components, network
components, and others. A number of frameworks are available that allow for easy prototype of scheduling algorithms and evaluate them on a variety of computing platforms.
A number of criteria exists, such as the ease of use, simulation speed and accuracy of the
framework, simulation scalability, and others, for choosing a framework that caters well to
the needs of the study. In this section, we discuss four of the commonly available discrete
event frameworks, namely, Bricks [75], MicroGrid [69], GridSim [26], and SimGrid [29].

3.2.1

Bricks

Bricks allows performance evaluation of computing systems with an emphasis on
network and scheduling algorithms. Resources are modeled using the queuing theory.
38

Bricks allows easy expression of the system, such as, the network topology, communication model, and others via scripts. Bricks is a Java based system and provides interfaces
for incorporating external global computing systems. Bricks has been validated by incorporating the National Weather Services system (NWS).

3.2.2

MicroGrid

MicroGrid is an emulator executing real applications on a virtual computational grid
rather than on a simulated computational grid. All the components of a grid, such as, CPU,
network and disks are virtualized. Virtualization is achieved by intercepting platform and
library calls. MicroGrid targets only Globus-compatible applications. Given that MicroGrid is an emulator, the ratio of the simulated time to the simulation time can be high.
Hence running large number of experiments for evaluating scheduling techniques can be
time intensive.

3.2.3

GridSim

GridSim is a Java-based simulation toolkit, developed under the Gridbus project and
built on top of the SimJava discrete event infrastructure. It allows for the simulation of
resources and schedulers for the design and evaluation of scheduling algorithms. It facilitates different classes of heterogeneous resources for solving large data-intensive scientifc
applications. GridSim supports only native threading used in current JVMs and does not
support TCP fow management mechanisms [42]. The current threading model places
a limit on the scalability when the number of communicating entities increases beyond
10,000 [48].
39

3.2.4

SimGrid

SimGrid provides a core functionality for the study and evaluation of scheduling algorithms in heterogeneous distributed computing environments. SimGrid offers four different
components for user interaction that provide APIs to simulate various types of applications.
The physical computing platform is expressed through the use of an XML specifcation fle.
Following a careful consideration of various performance evaluation environments, we
choose SimGrid for the following reasons: (1) it provides the ability to rapidly prototype
and evaluate scheduling algorithms, (2) it provides an adequate level of abstraction, and (3)
simulation scalability. A more detailed description of SimGrid is given in the next section.

3.3

Overview of SimGrid
SimGrid provides a simulation framework for studying and analyzing algorithms and

heuristics in large-scale distributed computing environments. The initial goal of the project
was to study the performance of scheduling algorithms in heterogeneous environments.
Over the years, the underlying architecture have been refned and many new features have
been added resulting in a more modular, scalable and faster simulation framework.
SimGrid allows the analysis of executing parallel applications on various platforms according to certain scheduling algorithms. The resources of the target platform are modeled
using discrete event simulation where the operation of the system is represented as a sequence of events in the chronological order. The computation model is simplistic and the
time required to compute a task is given by the ratio of the task computational requirements to the processing capacity of the processing resource. The communication model is
40

based on an analytical model of TCP where network fows are represented as fows in pipes
[54]. It is also possible to confgure packet-level network simulators for the communication
model instead of an analytical model.
The architecture of SimGrid is very modular. At its core, is SURF, the simulation kernel module. SURF provides features to simulate a virtual platform. During each simulation
cycle, SURF interacts with different types of resources needed at that point in simulation
time, to determine by how much the simulated time need to be advanced. After the completion of each simulated action (computation or communication), the users are notifed
and has a chance to execute their code. The SURF simulation kernel module is designed to
be extensible. Therefore, it is easy to plug in a different resource model. SURF deals with
very low-level operations, such as simulation details, and is not intended to be interacted
with directly.
SimGrid provides four modules for users to interact with as shown in Figure 3.1:

1. SimDag−useful for studying algorithms that deal with directed acyclic graphs (hence
the suffx Dag) of tasks,
2. MetaSimGrid (MSG)−for studying scheduling algorithms, though it is also useful
in other contexts,
3. GRAS−useful for developing real world applications within the simulator and allowing for seamless deployment onto real platforms, and
4. SMPI−useful for studying existing MPI applications.

41

Figure 3.1
Modular architecture of SimGrid

These modules are built on top of SURF. The interface provided by SimDag and MSG
interface in general is intended to be used by researchers, while the interface provided by
GRAS and SMPI targets application developers. Processes of the target platform are simulated either via the use of a thread library (usually pthreads) or via the use of UNIX98
contexts (ucontexts), also known as fbers. Simulation scalability becomes an issue when
using pthreads, while the use of ucontexts provides increased scalability. Simulating systems with tens of thousands (or more) of processors usually involves tuning a number of
system parameters. The memory consumption of running the simulation also has a direct
impact on the simulation scalability on a specifc platform. The bottom most layer of Sim-

42

Grid is the core toolbox layer, also known as XBT. The XBT layer provides support for
portability, logging, data structures, and others.
The MSG interface was originally developed for studying scheduling algorithms. It
offers a number of features, such as, tasks, data types, simulation functions, platform management functions, and others, which makes it a natural choice for our purpose. In the
MSG interface, a task is modeled by two parameters: (1) the amount of computing power
in FLOPs (foating point operations) required to compute the task, and (2) the communication volume (in bytes) required for transferring the task between two communication end
points. Two optional attributes can also be associated with a task–a name and a user level
data. Tasks can be transferred between hosts using synchronous or asynchronous communication modes. In the synchronous communication mode, the caller is blocked until the
transmission is completed, whereas in the asynchronous mode, the caller returns after the
initiation of the transmission, and is immediately available to perform other tasks. It is the
responsibility of the caller to verify the completion of the communication. Computing a
task, however, is always a blocking call, and the caller must wait until a task is computed.
SimGrid requires two input fles, namely, the platform fle and the deployment fle in
order to run a simulation. The topology of the platform is described in the platform fle,
and the deployment details are described in the deployment fle. A typical platform fle for
describing a cluster of workstations is as follows:
<cluster id=”mycluster” radical=”0-4095” power=”1.0E9”
bw=”2E9” lat=”2E-6” bb bw=”2E9” bb lat=”1E-5”>

43

This platform fle describes a cluster of 4096 workstations with each workstation capable
of delivering 109 FLOPS (foating point operations per second). The bandwidth and the latency of the private link connecting a host to the backbone network are 2x109 bytes/second
and 2x10−6 seconds, respectively. Similarly, bb bw and bb lat represent the bandwidth
and latency of the backbone network, respectively. A backbone network is used to connect
a pair of hosts, and by default, uses the fatpipe sharing policy wherein, each fow going
through the link gets the entire bandwidth of the link. Other felds of cluster tag, such as,
name, etc., are not shown. A typical structure of the deployment fle is as follows:
<! − − The master process with some arguments −− >
<process host=”Master” function=”master”>
<argument value=”-N16384”/>
<argument value=”-P128”/>
</process>
<!– The worker processes with no argument –>
<process host=”worker1” function=”worker”/>
<process host=”worker2” function=”worker”/>
The deployment fle lets the user map a function onto a host. In this example, the function
master will be executed on the host Master, and the function worker will be executed on
the hosts worker1 and worker2. Master, worker1, worker2 represent the host names which
must match with the names specifed in the platform fle. Arguments to the function can
also be specifed in the deployment fle as shown above. The prototype of the function

44

must match the main function prototype of C programming language as in int main(int
argc, char **argv).

3.3.1

Computation model

The computation model of SimGrid is very simplistic. The time required to process
a task is a simple ratio of the computation requirement of the task to the processor rating
of the processor that executes the task. The computation requirement of a task is defned
as the number of foating point operations needed to compute a task (FLOPs), and the
processor rating is defned as the number of foating point operations delivered per second
(FLOPS).

3.3.2

Communication model

Modeling the full complexities of TCP/IP networks in large scale environments is
often not feasible due to poor scalability both in terms of the number of network entities
involved and in terms of the simulation time. As a result, framework designers focus on
incorporating fow-based network models into their simulation framework in an attempt
to improve the scalability and the simulation speed. Flow-based network models closely
approximate the steady-state TCP/IP network behavior. In the steady-state, the problem is
to determine the amount of bandwidth allocated to each fow. The default network model
in SimGrid is a validated, fow-based analytical model [77].
In a fow-based model, a communication is modeled as a single entity between two
end points of communication rather than as packets across individual communication links
along the route between those two end points. If S represents the size of the message, L
45

represents the end-to-end latency, B represents the end-to-end bandwidth, then the transfer
time (T) of the message between two end points is given by:
T =L+

S
B

(3.1)

The end-to-end latency is defned as the sum of the latencies of all the network links along
the path between the two end points. Similarly, the end-to-end bandwidth is defned as the
minimum of the bandwidths of all the network links along the path between the two end
points. When the two end points of communication are connected by a direct link, then the
end-to-end latency and the end-to-end bandwidth is equal to the latency and bandwidth of
the link.
When there are multiple fows happening over a network link at any given time, then
the bandwidth of the link is shared between the fows. However, as shown in [32], the
bandwidth sharing principle does not exactly follow the proportional fairness principle,
where the bandwidth allocated to each fow is proportional to its size. Based on [32], the
throughput (or the bandwidth) for a TCP fow can be approximated as follows:
B =

c√
RT T · q

(3.2)

where RTT represents the round-trip time of a fow which is defned as the time between
sending a packet and receiving an acknowledgment, q represents the fraction of TCP packets lost, and c is some constant. If q is assumed to be a constant, then the bandwidth for a
fow is inversely proportional to its RTT. The RTT of a fow is a function of the network
link latency and based on [30], the relation is given by:
RT T = d · 2 · L
46

(3.3)

where d is some constant and L is the end-to-end latency. The multiplicative factor 2
indicates the round trip. Based on Equations (3.2), and (3.3), the bandwidth of a fow
is inversely proportional to the network link latency and higher latency results in lower
bandwidth.

Figure 3.2
Actual latency as a function of expected latency

Figure 3.2 and Figure 3.3 shows the actual latency (L) and the actual bandwidth (B) experienced by a fow on a single network link obtained by sending ping-pong messages between the two end points of the communication. The expected latency (L̂) and the expected
bandwidth (B̂) are the values specifed in the platform fle for the network link characteristics. Based on these fgures, it can be seen that the actual latency is a simple function of
the expected latency, while the actual bandwidth is a function of both the expected latency
47

Figure 3.3
Actual bandwidth as a function of expected latency and expected bandwidth

and the expected bandwidth. Specifcally, the actual latency is approximately one tenth
of the expected latency. The actual bandwidth is unaffected by the low latency values,
whereas for higher latency values the actual bandwidth drops signifcantly. For instance,
the actual bandwidth is approximately one hundredth of the expected bandwidth when the
expected bandwidth is 109 bytes/sec and the expected latency is 10−3 seconds. The relations between the actual and the expected network latency and bandwidth values can be
summarized as follows:
L

= β · L̂

ˆ
B = min(γ · B,

(3.4)
δ
ˆ ),
L

(3.5)

where β = 0.1, γ = 1, and δ = 104 , are empirically determined constants. The difference
in the expected and the actual values is attributed to how TCP functions. For instance,
48

since TCP offers congestion control, at most only W packets can be transmitted without
the receipt of an acknowledgment, where W is the congestion window size. For a more detailed description about the SimGrid’s network model, the reader may refer to the SimGrid
documentation [30].

3.4

Simulator design
The simulation study performed in this dissertation is based on a simulator built on top

of the SimGrid simulation framework. The central feature of the simulator is the workload
scheduler built on top of the APIs provided by the MSG interface. The workload scheduler
is an entity responsible for scheduling and mapping of tasks onto available resources, such
that an objective function (or a set of objective functions) is satisfed. The objective of the
scheduler is to minimize the overall processing time of the task queue, also known as the
makespan minimization problem.
The simulation is launched by SimGrid upon reading two mandatory fles: the platform fle and the deployment fle. The platform fle contains the description of the target
simulated platform. Details such as the hosts, their processing power, the interconnection
topology, and the link speeds are described in this fle. Thus, SimGrid provides the fexibility to simulate different platforms. The deployment details are specifed in the deployment
fle which allows the mapping of a function to execute onto a host (simulated processor or
node). Arguments can also be passed to simulated processes that runs on different hosts.
The platform and the deployment fles are described in XML format.

49

One of the simulated process is a special process called the master process which is
responsible for the scheduling and the scheduler runs in the context of the master process.
The scheduler maintains a task queue which is simply of collection of tasks. Tasks are
modeled using the task creation API provided by the MSG interface. Modeling a task
requires two parameters: the computational effort required to compute the data (measured
in FLOPs), and the communication size (measured in bytes) to transfer the data associated
with the task between two communication end points. The numerical values for the task
parameters are based on the modeling of real applications and is described in detail in
Chapter 4.

3.4.1

DLT simulation sequence fow

Figure 3.4
Execution fow when the scheduler employs DLT algorithm
50

In the traditional divisible load scheduling problem, it is assumed that the workload
for processing resides at a single source and is partitioned according to the optimality
principle and distributed to all available processing resources. This model of divisible load
scheduling is commonly referred to as single-source scheduling. In this work, we study
single-source scheduling. The single source that contains the workload is referred to as
the master process p0 and the rest of the processes that are involved in the processing
of the workload are called the worker processes. The master process is also involved in
processing the workload. Using a linear cost model for computation and an affne cost
model for communication, the runtime of a processor pi when the workload is partitioned
by applying the divisible load theory is given by:
⎧
⎪
⎪
⎪ 0 · T cp0
⎪α
⎨
Ti =
X
⎪
⎪
⎪
(Lj + αi · T cmj ) + αi · T cpi
⎪
⎩

if i == 0
(3.6)
else if i != 0

∀ j in r(p0 ,pi )

np −1

X

αi = n t

(3.7)

i=0

where r(psrc , pdst ) is an ordered list of links that represents the physical route from processor psrc to processor pdst . It is possible to use different cost models for computation
and communication. The linear and the affne cost models for computation and communication, respectively, are however, the most widely used models in DLT. Equation (4.1)
represents the expected fnish time of a processor pi and Equation (4.2) represents that the
sum of tasks distributed to all the processors equals the total number of tasks. It must
be noted that since DLT partitions the workload statically, the actual route r(psrc , pdst )
between two processors psrc and pdst must be known at the time of scheduling.
51

Figure 3.4 shows the sequence fow during simulation when the scheduler employs
DLT algorithm. The scheduler which runs in the context of the master process, solves
the linear system of equations represented by Equations (4.1), and (4.2) by applying the
optimality principle of DLT, which states for optimal load distribution, all processors must
fnish computing at the same instant. Once the load distribution α is identifed, the master
process distributes them to the worker processes and also computes its share of tasks. A
worker process upon computing its tasks, reports back to the master. The simulation ends
when all the worker processes reports back to the master process upon computing the tasks.
It must be noted that unlike in the case of DLS techniques, the worker processes receive
their tasks in one chunk since DLT identifes the load distribution in a static fashion.

3.4.2

System of linear equations

The load distribution α is a (np ) ordered tuple (α0 , α1 , · · · , αnp −1 ) and is obtained by
solving the equations represented by Equations (4.1), and (4.2) by applying the optimality
principle of DLT. An approach to obtain α is to represent the linear system of equations in
a matrix equation of the form Ax = b and solve for x. A is a square matrix of size np + 1
with all known quantities, x is a column vector of np + 1 unknown quantities, and b is a
column vector of np + 1 known quantities. The known quantities are nt , Lj , Bj , T cmj ,
and T cpi , and the unknown quantities are the load fractions, α and the parallel runtime,
Tpar . Through the process of LU factorization, the matrix A is decomposed into a product
of a lower triangular matrix L and an upper triangular matrix U , such that L(U x) = b.
First, the La = b form of the matrix equation is solved and then the U x = a form of
52

the matrix equation is solved to obtain the fnal solution. The Boost [4] library is utilized
for solving the linear system of equations since it provides an easy way of calling uBLAS
c++ template class library that provides support for storing and handling vector and matrix
operations.

3.4.3

DLS simulation sequence fow

Figure 3.5
Execution fow when the scheduler employs DLS techniques

Figure 3.5 shows the sequence fow during a simulation when the scheduler employs
DLS techniques. The SimGrid simulation framework launches the simulation upon reading the platform and the deployment fles. The master process which is responsible for
53

scheduling the workload, when launched, is in a continual listen state where it listens for
work requests from the worker processes. The worker processes which is responsible for
processing the workload, when launched, are in a request state where they request the master process for work. Upon receiving a work request, and if the work queue is not empty,
the master process allocates work to the requesting worker process based on the rules of
the scheduling policy. The scheduling techniques the scheduler implements are based on
probabilistic analyses and are described in Chapter 2. Upon receiving the work, the worker
process transitions to the working state where the workload is processed. When fnished,
the worker process transitions back to the request state and sends a request for more work.
When the work queue is exhausted, the master process responds to further work requests
by sending a ‘work queue empty’ message. Upon receiving this message, the worker process will return the fow of control back to the caller (MSG API). The master process will
also return the control fow back to the caller upon notifying all the worker processes about
the empty work queue.

3.5

Conclusions
In this chapter, we presented our rationale for preferring simulations over measure-

ments as our evaluation technique. We then reviewed the various evaluation environments
available, and discussed the reasons for choosing SimGrid as our evaluation environment.
We gave a brief introduction to SimGrid, followed by the design of the simulator which is
one of the main contributions of this research work. Through the use of sequence diagrams,
we also explained the working of the scheduler.
54

CHAPTER 4
APPLICATION AND PLATFORM MODELING

In Chapter 3, we explained the reasons for choosing simulation as the performance
evaluation technique. A common risk in simulation is the extent of the validity and the
applicability of the results when unvalidated models are used or when simulations do not
capture realism. We address the concern of unvalidated models in the simulation by employing a validated simulation framework. In this chapter, we describe the modeling of
the applications and the platforms used in this study with the goal of conducting realistic
simulations.

4.1

Applications
Scientifc applications have different characteristics and not all applications are suit-

able for our study. The most suited applications are those that follow the Bag-of-Tasks
(BoT) model, where an application consists of tasks without dependencies that can be processed in any order. A large class of applications, such as, N-body simulations, Monte
Carlo simulations, pattern matching, fle compression, database join operation, graph coloring and others, can be categorized as a BoT application. Considering there are nt tasks
in an application, the following system of linear equations can be formulated when the
application is partitioned by applying the DLT.
55

Ti =

⎧
⎪
⎪
⎪
⎪α
⎨ 0 · T cp0
⎪
⎪
⎪
⎪
⎩

X

if i = 0
(4.1)
(Lj + αi · T cmj ) + αi · T cpi

else if i 6= 0

∀ j in r(p0 ,pi )

np −1

X

αi = n t

(4.2)

i=0

The relevant notations are given in Table 2.1. The simulations become more realistic
when the characteristics of the simulated tasks and the simulated platforms are modeled
after real tasks and real platforms, respectively. The simulated applications capture the
computational and communication characteristics of two applications from the NAS parallel benchmark suite: embarrassingly parallel (EP) and integer sort (IS), and three other
applications with different communication-to-computation (CCR) ratios.

4.1.1

The embarrassingly parallel (EP) NAS benchmark

The embarrassingly parallel benchmark represents a class of applications without any
signifcant inter-processor communication. One such application available in the NAS parallel benchmark suite is for the problem of generating pairs of Gaussian random deviates,
which is characteristic of many Monte Carlo simulations. This benchmark is also referred
to as pure computation application.

4.1.1.1

Algorithm

For a given problem size g, the algorithm generates 2 · g real-valued pseudorandom
numbers ri , 1≤ i ≤ 2·g, within the interval (0,1). Then, g pairs (xj ,yj ), 1 ≤ j ≤ g, are
generated such that xj = 2 · r2·j−1 − 1 and yj = 2 · r2·j − 1, and are uniformly distributed
56

within the interval (-1,1). For each pair (xj ,yj ) that satisfes the inequality tj = x2j + yj2 ≤
1, (xj ·

p

(−2 · log tj )/tj , yj ·

p

(−2 · log tj )/tj ) represents a pair of Gaussian random

deviates. This algorithm generates approximately g · π4 pairs of Gaussian random deviates.

4.1.1.2

Implementation

The algorithm implementation consists of a double nested for loop. The outer loop
generates pseudorandom numbers which are consumed by the inner loop. The inner loop
generates the pairs of Gaussian random deviates. The size of the outer loop is dependent on
the problem size g, while the size of the inner loop is kept fxed at si = 216 . For instance, for
class C problems, g = 232 and the size of the outer loop is so = g/si = 216 . Each iteration
of the outer loop generates 2 · si = 217 pseudorandom numbers. Together, the outer and the
inner loops generate 2 · g real-valued pseudorandom numbers and g ·

π
4

pairs of Gaussian

random deviates. Each iteration of the outer loop can be processed independently.

Figure 4.1
Sequential runtime of the real vs. the simulated EP NAS benchmark for different problem
sizes
57

In Figure 4.1, the sequential runtimes of the real and the simulated version of the EP
benchmark are plotted for different problem sizes ranging from class S to class D. The
sequential runtime of the EP benchmark was obtained by executing the benchmark on a
single core of an Intel R CoreTM i7-2670QM CPU @ 2.20GHz system with 8 CPUs. The
simulated sequential runtime for the EP benchmark was obtained by running the simulated
application model using SimGrid on a target platform that represented the characteristics
of the real Intel-based platform mentioned above.
The computational requirements of the EP benchmark were modeled via profling its
sequential execution on the Intel-based platform. Based on the timings obtained by executing the benchmark, the computational effort required per iteration of the outer loop was
estimated (via a counter in the source code on the number of operations executed) to be
approximately 9,662,032 foating point operations (FLOPs) to generate 216 pairs of Gaussian random deviates, and approximately 147 FLOPs to generate a single pair of Gaussian
random deviates (a single iteration of the inner loop). The processing speed (or processor
rating) of the simulated target platform was set to 1.72 GFLOPS (109 or giga-foating point
operations per second). Using the above values in the simulation, the percentage difference between the sequential runtime of the real EP benchmark, and the sequential runtime
of the simulated EP benchmark ranged between [0.99% − 1.23%] suggesting a close ft of
the estimated computational requirement with the actual computational requirement of this
benchmark.

58

4.1.2

The integer sort (IS) NAS benchmark

The integer sort benchmark performs a parallel bucket sort on a sequence of integers,
representative of many particle method-based simulations. In contrast to the EP benchmark, this benchmark involves non-negligible communication.

4.1.2.1

The sorting problem

Given a sequence of keys, {ki | i = 0, . . . , n − 1}, rearrange the keys in a sequence
{s0 , s1 , . . . , sn−1 }, such that ks0 ≤ ks1 ≤ · · · ≤ ksn−1 .

4.1.2.2

Key generation algorithm

The keys for sorting are sequentially generated by a key generation algorithm. The
generated keys range from [0, kmax ), where kmax is fxed and depends upon the problem
size. For instance, for class A problems n = 223 and kmax = 219 , and for class B problems
n = 225 and kmax = 221 . Four real-valued random numbers r uniformly distributed over
[0, 1] are used to generate a key, ki , where ki ← b(kmax /4) · (rai + rbi + rci + rdi )c, ∀
i = 0, 1, . . . , n − 1.

4.1.2.3

Implementation

The code for sorting the keys makes use of buckets. The number of buckets, M , is kept
fxed at 1024 for all problem classes, except for class S when it is set to 512. The overall
problem of sorting the keys consists of three phases of computation. In computation phase
1, each parallel process places its keys into the corresponding buckets. Within a bucket mi
where i = 0, 1, . . . , M − 2, the keys are unsorted, but between any two successive buckets
59

mi and mi+1 , the minimum value of a key in bucket mi+1 is always greater than or equal
to the maximum value of a key in bucket mi . In computation phase 2, using the bucket
size total for the entire problem, each process determines the redistribution of the keys.
Specifcally, each process determines how many keys it needs to distribute to every other
process. In the third and the fnal computation phase, each process sorts its share of the
keys, and all the keys are sorted at the end of this phase.

Figure 4.2
Sequential runtime of the real vs. the simulated IS NAS benchmark for different problem
sizes

Figure 4.2 illustrates the sequential runtimes of the real and simulated versions of the
IS benchmark plotted for different problem sizes, ranging from class S to class C. Note that
no communication time is contained in the values plotted in the Figure 4.2 since IS was
executed sequentially. The real benchmark was executed on the same Intel-based system as
the EP benchmark, while the simulated benchmark was obtained by running the IS applica-

60

tion model using SimGrid on a target platform that represents the characteristics of the real
Intel-based system. The application model for the IS benchmark is obtained by profling
the execution of the IS benchmark to capture its computational requirements. Based on the
timings obtained by executing the IS benchmark, the computational requirements of the
three computation phases were estimated to be approximately 140.55, 67, and 57.4 operations, respectively. The computation phases 1 and 3 each require O(n/p) operations, while
the computation phase 2 requires O(M ) operations. The processor rating of the simulated
platform was set to 1.72 GFLOPS. Using the above values in the simulated sequential execution of IS, the percentage difference between the sequential computational times of the
real application and simulated application was found in the range [1.75%−4.64%] suggesting a close ft of the estimated computational requirements with the actual computational
requirements of this benchmark.

4.1.3

Applications with different CCRs

Three other applications that exhibit different communication-to-computation ratios
(CCR) ranging from computation to communication-bound are also considered. They are:
1) a computation-bound application where each task performs multiplication of two dense
square matrices of size 3500 containing real numbers; 2) an intermediate application where
each task performs sorting a list of one million real numbers; and 3) a communicationbound application where each task performs addition of two dense square matrices of size
3500 containing real numbers. Matrix multiplication, matrix addition, and sorting operations are frequently encountered in many scientifc applications.
61

Table 4.1
Task characteristics
Application type
pure computation
computation-bound
intermediate
communication-bound

zt
(MB)
0
196
8
196

wt
(MFLOPs)
9.66
85,737.75
13.81
12.25

CCR
(bytes/FLOP)
0
0.002
0.579
16.0

The task characteristics for the four applications are given in Table 4.1. The IS benchmark does not ft into the BoT model and hence it is not represented in the table. A
task is defned by two attributes, namely, the communication requirement represented as
zt , and the computation requirement represented as wt . The communication requirement
represents the amount of data required to transfer a task from one processor to another
processor in order to compute the task at the receiving processor. The communication requirement is expressed in MB (million bytes). The computation requirement represents
the computational effort required to compute a task and is expressed in MFLOPs (million
foating point operations). The CCR of the tasks - which is derived from its attributes is also given in Table 4.1 and is expressed in bytes per FLOP. The CCR represents the
granularity of tasks. Tasks with low CCRs, such as, the ones that represent the pure computation and computation-bound applications are considered fne grained, and tasks with
high CCRs, such as, the one that represents the communication-bound application are considered coarse grained. Fine grained tasks are crucial to obtain high execution effciency

62

from parallel processing. The task characteristics of the computation, intermediate, and
the communication-bound applications are based on the work reported in [20].

4.2

Platforms
An high performance parallel and a distributed computing platform is a collection

of processors designed to process parallel tasks of an application in a concurrent fashion
leading to a reduced processing time of the application. The network topology of a platform refers to the manner in which the processors are linked together. Based on linking,
networks can be classifed as direct or indirect. In a direct network, a direct connection
exists between processors, and in an indirect network, no direct connection exists between
processors. An indirect network typically involves switches. In this work, we make use of
both direct and indirect networks with the following interconnection topologies.

4.2.1

Star

In a star network, the root node is directly connected to all its children nodes. More
formally, a system of np processors (p0 , p1 ,. . . pnp −1 ) and np − 1 links (l1 , l2 ,. . . lnp −1 ) are
said to be interconnected in a star fashion, if and only if a communication link li exists
between the nodes p0 and pi , where 0 < i < np .
A star topology has a number of advantages, such as, a very small network diameter
(the maximum distance between any two compute nodes is only 2 hops) and high fault
tolerance (the failure of a child node or a link has no impact on the rest of the network).
However, it also has a central point of failure and a high degree of the root node (np − 1)
which can pose a limitation on the size of the network. The motivation for choosing this
63

Figure 4.3
Illustration of a star topology with 4 compute nodes and 3 links

topology is based on the advantages it offers along with the number of research works that
utilizes this topology.

4.2.2

Cluster

A computer cluster represents a collection of computing nodes interconnected using
a fast interconnection technologies, such as, Ethernet, Gigabit Ethernet, Infniband, and
others, and is viewed as a single computing platform. The main advantage of a cluster is
its cost effectiveness which makes it an attractive choice as an high performance computing
platform.
Figure 4.4 shows the representation of a cluster within the SimGrid simulation framework. Each processing node is connected via a dedicated link to the backbone network.
The backbone network is used to connect any pair of processing nodes and has typically a
higher bandwidth than the individual dedicated links. A sharing policy can also be specifed on the backbone network. By default, the sharing policy is set as fatpipe, wherein,
each fow going through the backbone network gets the entire bandwidth.
64

Figure 4.4
Illustration of a cluster with 4 compute nodes and 4 links

4.2.3

3D torus

A 3D torus is one of the widely used platform in the scientifc community. Some of
the well known supercomputers that employ 3D torus are Blue Gene/L [5], Blue Gene/P
[6], and Cray XK7 [1]. These supercomputers are found in renowned research laboratories
such as the Oak Ridge National Laboratory’s Titan supercomputer which is a Cray XK7
system.
A system of n = nx · ny · nz nodes is said to be connected in a 3D torus fashion, if
each node is directly connected to exactly six neighboring nodes, two nodes along each
of the x, y, and z axes, where nx , ny , and nz represent the number of nodes along x, y,
and z axes, respectively. A 4 × 4 × 4 3D torus is illustrated in Figure 4.5. Each node pi ,
1 ≤ i ≤ n, has a unique representation (pix , piy , piz ), where 0 ≤ pix < nx , 0 ≤ piy < ny ,
and 0 ≤ piz < nz . Given the unique representation of two nodes pi and pj , they are
connected by a direct link along x axis, if and only if the following equalities are satisfed:
piy = pjy , piz = pjz , and |pix − pjx | = 1 or |pix − pjx | = nx − 1. Similarly, along y axis
65

Figure 4.5
Illustration of a 4 × 4 × 4 3D torus with 64 compute nodes and 192 links

the following equalities must be satisfed for a direct link between pi and pj : pix = pjx ,
piz = pjz , and |piy − pjy | = 1 or |piy − pjy | = ny − 1. Similar equalities must be satisfed
for a direct link to exist between two nodes pi and pj along z axis. There are 3 · n links in
this topology.
Dimension order routing, also known as XYZ routing, is usually employed to route a
message between two nodes pi and pj . In dimension order routing, the message is routed
frst along the x axis until pix = pjx , then along the y axis until piy = pjy , and fnally along
the z axis until piz = pjz . Since a node is connected to two neighbors along each axes, the
shortest path is chosen when routing a message along an axis. The number of hops traveled

66

by a message when routed in this fashion is given by: |pix − pjx |+ |piy − pjy |+ |piz − pjz |
and the maximum number of hops traveled by a message is given by:

4.2.4

(nx +ny +nz )
.
2

Fat-tree

A fat-tree network is an example of an indirect network where switches are used to
connect large number of processors. One of the distinctive features of this topology is that
the tree becomes fatter when traversing bottom-up from the leaf nodes.

Figure 4.6
Illustration of a 2-level fat-tree network with 4 compute nodes

Figure 4.6 shows a 2-level fat-tree network with 4 compute nodes. The compute nodes
are represented by circles and the switches are represented by squares. Fat-tree networks
can be found in renowned research laboratories, such as, the Stampede system in TACC
(Texas Advanced Computing Center) which is a 2-level fat-tree network. The Stampede

67

system consists of 8 top level switches and 320 leaf level switches. Each leaf level switch is
connected to 20 compute nodes. Thus, the Stampede system supports 6400 compute nodes.
The fat-tree topology used in the simulations are modeled after the Stampede system and
hence it is a 2-level fat-tree with no more than 20 nodes connected to a leaf level switch.
SimGrid simulates fat-tree networks based on the work by [82]. The topological parameters of a fat-tree network can be described as follows: h; lh , · · · l1 ; uh , · · · u1 ; ph , · · · p1 ,
where h is the height of the fat-tree, li is the number of lower level nodes connected to a
node at level i, ui is the number of upper level nodes connected to a node at level i − 1, and
pi is the number of parallel links connecting two nodes between level i and level i − 1. At
level 0 are the compute nodes and all levels above that consists of switches. The number
of nodes in the network is given by the product:

Qh

i=1 li .

Based on this method, the fat-tree

network shown in Figure 4.6 can be represented by 2; 2, 2; 2, 1; 1, 2.
Routing in a fat-tree network is based on destination-mod-k routing scheme. Routing
consists of two phases, namely traversing up the tree until the source and the destination
nodes are in the same sub-tree and traversing down from the sub-tree to the destination.
This routing scheme is described in detail in [82].

4.3

Conclusions
In this chapter, we described the modeling of the applications and the platforms used

in this study. Two applications were chosen from the NAS benchmark suite, namely the
EP and the IS benchmark, and three other applications were chosen that exhibits different

68

CCRs ranging from computation-bound to communication-bound. The simulated platforms are modeled after star, cluster, 3D torus, and fat-tree topologies.

69

CHAPTER 5
A SCALABILITY STUDY OF DLT AND DLS ALGORITHMS

Scalability of an algorithm is defned as its measure to provide performance proportional to resource usage. Scalability is an important attribute of an algorithm, especially in
the context of large scale systems. In this chapter, we present a scalability study of DLT
and DLS algorithms which were reported in [10] and [11], respectively.

5.1

Why scalability?
Modern high-end supercomputing facilities offer resources that provide a peak per-

formance of 1015 FLOPS (foating point operations per second) or petafops [2]. Several
initiatives have already begun with the goal of achieving exascale (or 1018 FLOPS) performance towards the end of the current decade. Such systems will enable further progress
in scientifc areas such as, material science, earth science, fundamental science, biology,
medicine, and others [49]. As the scientifc applications and the system sizes continue
to increase, it is important that application scheduling algorithms scale well to leverage
the processing capabilities of the high performance computing systems. Scalability of an
algorithm can be defned as its measure to provide performance proportional to resource
usage and is an important attribute of an algorithm, especially in the context of large scale
computing systems. Effcient and scalable scheduling algorithms are needed, given that
70

sheer increasing problem sizes can quickly overwhelm even the most powerful computing
systems.
In chapter 3, we presented our rationale for using simulations in this work. Moreover,
calls for compute time allocation on large supercomputers usually require preliminary studies that demonstrate the scalability of the program to very high processor numbers, and
proof of very good scalability at a lower number of processors (usually the size of a rack,
such as, 4096 processor cores). Therefore, simulation-based approaches provide a signifcant help in conducting such preliminary scalability studies.

5.2

Strong scaling versus weak scaling
A scalability analysis can be performed either via strong scaling or weak scaling or

both. In strong scaling, the problem size is kept fxed but the number of processors used
to solve the problem are scaled up. This kind of analysis helps to quantify how different
performance metrics such as parallel runtime, effciency of an algorithm, and others vary
with the increase in the number of processors. Strong scaling can be used to answer questions such as, the maximum possible speedup achievable on a given system, the maximum
number of processors to be used in order to maintain a certain level of effciency, and others. In weak scaling, both the problem size and the number of processors are scaled up,
but the problem size per processor is kept fxed. Isoeffciency metric is often used in this
kind of scalability analysis which dictates the rate of growth of the problem size required
to maintain a constant effciency as the number of processors increase.

71

5.3

DLT algorithms
In this section, the scalability analysis and evaluation of the two NAS benchmark ap-

plications (embarrassingly parallel and integer sort) when DLT is applied to partition the
workload are presented. Two questions related to their scalability analysis are addressed:
(a) What is the fastest time to execute the application? and (b) How should the application
size be scaled in relation to the system size in order to maintain a constant effciency? The
frst question is related to strong scaling and the second question which is related to isoeffciency analysis, arises in weak scaling. The design of the simulation experiments for the
empirical scalability study is shown in Table 6.6. For the EP benchmark, problem classes
D and E refer to an input size of 236 and 240 , respectively. For the IS benchmark, problem
class D refers to an input size of 231 . The IS benchmark is not as computationally intensive
as the EP benchmark and hence the number of processors used to solve the problems are
different. The design of experiments to study the scalability of DLT algorithms is provided
in Table 5.1.

5.3.1

EP benchmark

The EP benchmark is described in detail in Section 4.1.1. The following linear equations can be formulated when DLT is applied to schedule the EP benchmark.
Ti = αi · T cpi , and

(5.1)

np −1

X

αi = n t .

(5.2)

i=0

Equation (5.1) represents the computation time incurred by processor pi in processing a
load fraction of αi tasks. The sum of tasks assigned to all processors equals the total num72

Table 5.1
Design of experiments to study the scalability of DLT algorithms
Application under study
Problem class
Number of divisible tasks (nt )

EP
D, E
20
2 , 224

Computational effort/task (wt )

147

System size (np )
System topology
Node power rating (GFLOPS)
Link latency L (µseconds) [6]
Link bandwidth B (GB/s) [6]

256-8192
3D torus
1.72
8
1.5

IS
D
231
140.55 (ph1 )
67 (ph2 )
57.4 (ph3 )
64-1024
3D torus
1.72
8
1.5

ber of tasks nt (Equation (5.2)). The relevant notations are given in Table 2.1. The parallel
algorithm to generate pairs of Gaussian deviates has negligible communication requirements. Specifcally, 10 foating point values are transferred at the end of the computation
among all the processors for verifcation. Thus, this communication is not modeled in the
simulated application.
The parallel runtime is the time required to solve the EP problem using divisible load
scheduling and is given by:

Tpar (nt , np ) = OEP (np ) + Tsolve (nt , np )

(5.3)

where OEP (np ) is the overhead involved in obtaining the solution according to DLT.
Divisible load theory identifes the load distribution α, a (np +1) ordered tuple given by
(α0 , α1 , · · · αnp −1 , Tpar ), by solving a system of np +1 linear equations. For performance
prediction, the overhead of obtaining the load distribution α must be included in the overall
73

cost of solving a problem using the DLT. By using the actual time required to solve the system of linear equations and by using the least squares quadratic approximations technique
to ft the data, the number of FLOPs required to obtain α was estimated as:
fEP (np ) = 241.97 ∗ n2p − 11519.5 ∗ np + 2104670
OEP (np ) =

(5.4)

fEP (np )
R0

The overhead cost is included in all the analysis.

5.3.1.1

Analytical modeling

The processor equivalence concept of DLT can be used to model the time required to
generate pairs of Gaussian deviates. As explained in Section 8.1, all the processors in the
system can be combined and replaced by an equivalent processor whose processing time
of a task is given by:
T cpeq (np ) =

1
np −1

X

(5.5)

T cpi

i=0

Equation (5.5) states that in an homogeneous network, the equivalent processor will process an EP task in

1
th
np

of the time required by any processor in the non-combined system

to process the same task. The time required to solve the EP problem is given by:
Tsolve (nt , np ) = nt ∗ T cpeq (np )

5.3.1.2

(5.6)

Effciency analysis

The effciency of the parallel solution is the ratio of the sequential and the parallel cost
of solving the problem. The Figure 5.1 shows a close match between the predicted and
74

Figure 5.1
Predicted vs. simulated effciency of parallel solution of two EP problem classes, D and
E, on [256-8192] processors

the simulated effciency of the parallel solution of EP benchmark for two problem classes,
D and E. The problem class E is sixteen times larger than the problem class D and hence
solving problem class E on larger number of processors yields more effciency than solving
problem class D on the same number of processors. The simulated value is obtained via
simulations with the solution obtained by solving Equations (5.1) and (5.2).
Figure 5.2 shows the predicted effciency plot for all the problem classes for a wide
range of system sizes. The darker shade in the fgure indicates a lower effciency and
the lighter shade indicates an higher effciency. For the largest system size used (8192
processors), the effciency is in single digits on a scale from [0-100] for all problem classes
except for the problem class E. For all the problem classes, the effciency drops with the
75

Figure 5.2
Predicted effciency plot for all EP problem classes on [256-8192] processors

increase in the number of processors, although, the rate at which it drops is different for
different problem classes.

5.3.1.3

Fastest parallel execution time

Equation (5.3) can be used to fnd the fastest time to solve the EP problem. The minimum value of Tpar (nt , np ) occurs when its frst derivative with respect to np is zero.
d
Tpar (nt , np ) = 0 =⇒
dnp
483.96 ∗ n3p − 11519.5 ∗ n2p − nt ∗ wt = 0

(5.7)
(5.8)

As an example, for the problem class D, the closest integer solution to Equation (5.8) is
np = 2764. For this value of p, the parallel runtime as given by Equation (5.3) is 3.188
seconds. The simulation yields a parallel runtime of 3.189 seconds for np = 2768 which is
the closest representable confguration in our simulation setting. The second derivative of
76

Tpar (nt , np ) is a positive quantity at np = 2764 which satisfes the necessary condition for
the minimum value of a function.

Figure 5.3
Isoeffciency contours for all EP problem classes on [256-16384] processors

5.3.1.4

Isoeffciency analysis

Figure 5.3 shows four predicted isoeffciency contours for all the problem classes for
a wide range of system sizes. The linear nature of the contour curves suggests that the
parallel system is highly scalable. The sequential cost of solving the problem is Θ(nt ).
From Equation (5.3), the parallel runtime is given by:
Tpar (nt , np ) = Θ(n2p +
77

nt
)
np

(5.9)

and its cost is Θ(n3p + nt ). As long as nt = Ω(n3p ), the parallel cost is Θ(nt ), which is same
as the sequential cost, the parallel system is cost-optimal. For example, the parallel system
remains cost-optimal at an effciency of 0.8 when the input size is in the range [228 − 240 ]
and the system size is in the range [345 − 5500]. The relationship between nt and np is
given by nt ≈ 6.55 · n3p .

5.3.2

IS benchmark

The IS benchmark which is described in detail in Section 4.1.2, consists of three computation phases interleaved by three communication phases. The three communication
phases include one one-to-all communication and two all-to-all communications. In the
communication phases, the presence of multiple fows over a network link at same time
will lead to congestion in that network link. If γab represents the congestion factor between
the two end points of communication, namely, processors pa and pb , then the following linear equations can be formulated when DLT is applied to schedule this benchmark.
Equations (5.10) and (5.11) have the same meaning as Equations (5.1) and (5.2), respectively. The IS benchmark assumes that all the integer keys to be sorted reside in a
single processor memory (say p0 ). In communication phase 1, there is a one-to-all communication and all the processors receive their share of data from processor p0 . In communication phase 2, there is an all-to-all communication where each processor distributes the
count of the number of keys in each bucket (Msz is the total amount of data transferred by
each processor which is O(M ) in size) to every other processor. In communication phase
3, there is an all-to-all communication where each processor distributes approximately
78

O(nt /n2p ) keys to every other processor. The congestion factor γab can be estimated based
on the number of fows over a link at a given time using the routing information. In a 3D
torus network, messages are routed using the dimension order routing scheme, also known
as the XYZ routing scheme. The presence of congestion leads to higher communication
times.
Ti = (L0i + γ0i · αi · T cm0i ) + αi · T cpph1
np −1

+

X

(Lij + γij · Msz · T cmij ) + M · T cpph2

(5.10)

j=1,j6=i
np −1

+

X

(Lij + γij ·

j=1,j6=i

nt
αi
· T cmij ) +
· T cpph3 ,
np
np
np −1

and

X

αi = n t

(5.11)

i=1

The parallel runtime is the time required to solve the IS problem using divisible load
scheduling and is given by:

Tpar (nt , np ) = OIS (np ) + Tsolve (nt , np )

(5.12)

where OIS (np ) is the overhead involved in obtaining the solution according to DLT. Using
the same methodology as in the case of EP benchmark, the number of FLOPs required to
obtain α was estimated as:
fIS (np ) = 22315.3 ∗ n2p − 5314880 ∗ np + 506075000
OIS (np ) =

fIS (np )
R0

The overhead cost is included in all the analysis.

79

(5.13)

5.3.2.1

Analytical modeling

The time required to sort the integer keys in parallel can be modeled as follows: If
oneT oAll(messageSize) and allT oAll(messageSize) represent the one-to-all and allto-all communication costs, respectively, then the sorting time is given by:
Tsolve (nt , np ) = oneT oAll(nt ) + β · T cpa,ph1
+ allT oAll(Msz ) + M · T cpa,ph2
α
nt
+ allT oAll( ) +
· T cpa,ph3 .
np
np

(5.14)

In the one-to-all communication phase, the integer keys to be sorted is partitioned and
distributed to all the processors. In a 3D torus network, each processor is connected to
six other processors via six communication links. If all the links are utilized fully, then
approximately 16 th of the data will traverse over a communication link. Using this model,
the one-to-all communication cost can be modeled as: nt /(6·B), where B is the bandwidth
of a communication link. If x represents the volume of data to be transferred in an all-toall communication by a processor to every other processor, then the total volume of data
transferred is given by x · np · np . The total bandwidth of all the links in a 3D torus network
is given by 3 · np · B where 3 · np represents the number of communication links in a
3D torus network. The communication time of an all-to-all communication is equal to
(C · x · np ∗ ·np )/(3 · np · B), where C is some constant. From experiments, we found the
value of the constant to be 1 in case of small sized messages, and ≈ 32 for large sized
messages. For this problem, Msz is considered a small sized message since its size is only
4096 bytes and α/np is considered a large sized message. The presence of the constant
80

factor C can be thought of as a congestion related factor and an intuitive reasoning for
its presence is attributed to the XYZ routing scheme. In the XYZ routing scheme, all the
messages are routed along the x axis frst, the y axis next, and then the z axis. A more
load balanced routing may be possible (or a better utilization of the network links) if the
axis along which the message is routed is not kept fxed but rather utilize the network links
along another axis in case of congestion. For estimating β, we consider that the tasks are
distributed in proportion to the congestion factor, such that, if γmean and γmax represent the
mean and the maximum value of the congestion factor, then β = (nt · γmean )/(np · γmax ).
An intuitive reasoning behind this is that the higher the communication cost associated
with a processor, the lower the load fraction of that processor. The ratio γmean /γmax is
≈ 0.75.

5.3.2.2

Effciency analysis

Figure 5.4 shows the predicted vs. the simulated effciency of the parallel solution of
IS problem class D. The simulated values are obtained via simulation with the load fractions obtained by solving equations Equations (5.10) and (5.11). The predicted values are
obtained using the analytical model. Unlike the EP benchmark where the maximum effciency was ≈ 99% for both the problem classes D and E for 256 processors, the maximum
effciency is only ≈ 16% for the problem class D for 64 processors. This is due to the
fact that the IS benchmark is not as computationally intensive as the EP benchmark, and
hence increasing the number of processors to solve the problem does not result in improved
performance.
81

Figure 5.4
Predicted vs. simulated effciency of parallel solution of IS problem class D, on [64-1024]
processors

Figure 5.5
Predicted effciency plot for all IS problem classes on [64-1024] processors

82

Figure 5.5 shows the predicted effciency for all the IS problem classes for a wide
range of system sizes. The darker shade indicates a lower effciency and a lighter shade
indicates a higher effciency. The maximum observed effciency was ≈ 20% for problem
classes A,B,C and D, when np = 64 processors. With the increase in the system size,
the effciency drops precipitously even for the largest problem class D, suggesting that the
problem size (231 integers) is not big enough to justify the increase in the system size.

5.3.2.3

Fastest parallel execution time

Equation (5.12) can be used to fnd the fastest time to solve the IS problem for a given
problem size. Equation (5.14) is used for the Tsolve (nt , np ) term in Equation (5.12). The
minimum value of Tpar (nt , np ) occurs when its frst derivative with respect to np is zero.
For instance, for problem class D, nt = 231 . After substituting the known values and
simplifcation, we have:
d
Tpar (nt , np ) = 0 =⇒
dnp
n3p − 119.05 ∗ n2p − 3137492.3 = 0

(5.15)
(5.16)

The closest integer solution to Equation (5.16) is np = 198. For this value of np , the parallel
runtime as given by Equation (5.12) is 1.67 seconds. The simulation yields a parallel
runtime of 1.80 seconds for np = 192 which is the closest representable confguration in our
simulation setting. The second derivative of Tpar (nt , np ) is a positive quantity at np = 198
which satisfes the necessary condition for the minimum value of a function.

83

Figure 5.6
Isoeffciency contours for all IS problem classes on [64-1024] processors

5.3.2.4

Isoeffciency analysis

Figure 5.6 shows four predicted isoeffciency contours for all the IS problem classes
for a wide range of system sizes. From Equation (5.12), the parallel runtime is given by:
Tpar (nt , np ) = Θ(n2p +

nt
)
np

(5.17)

and its cost is Θ(n3p + nt ). As long as nt = Ω(n3p ), the parallel cost is Θ(nt ), which is
same as the sequential cost, the parallel system is cost-optimal. For example, the parallel
system remains cost-optimal at an effciency of 7.5% when the input size nt is in the range
[227 − 231 ] and the system size is in the range [64 − 192]. The relationship between nt and
np is given by nt ≈ C · n3p , where C is a numerical constant. It must be noted that both
the EP and the IS benchmark have the same asymptotic growth as a function of the input
84

size for cost-optimality. However, the effciency of the parallel solution of IS benchmark
is lower than that of the EP benchmark due to higher communication overhead.

5.4

DLS algorithms
Earlier studies demonstrating the effectiveness of DLS algorithms used fewer proces-

sors and smaller problem sizes. In this work, we study the scalability of the DLS algorithms and evaluate their performance for large scale problems and systems. Scalability
results of non-adaptive DLS algorithms (algorithms that do not adapt to runtime load imbalances) have recently been reported [70]. In this work, we report on the scalability of
both non-adaptive and adaptive DLS algorithms at a larger scale. The platform used in the
simulations is a cluster of homogeneous workstations capable of delivering 109 FLOPS.
The network link latency and bandwidth are 2 · 10− 6 seconds and 2 · 109 bytes/second,
respectively.
Experiments were designed to study the scalability of the DLS algorithms in addressing
various sources of load imbalances, such as, algorithmic and systemic variances. Many
possibilities exist for combining the different types of variances. For instance, the ordered
pair (algorithmic variance, systemic variance) can be (Gaussian, exponential) where the
algorithmic variance follows a Gaussian distribution, and the systemic variance follows an
exponential distribution, or (constant, exponential) in which case, there is no algorithmic
variance but the systemic variance follows an exponential distribution.
Systemic variances can be modeled in SimGrid by varying the availability of each processor in the system. The availability of a processor is the fraction of its computing power
85

Table 5.2
Processor availability in numbers
Availability
range
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0

% of processors
uniform exponential
0.116
0.413
0.110
0.126
0.111
0.098
0.115
0.077
0.111
0.056
0.106
0.044
0.114
0.139
0.109
0.026
0.105
0.017

available towards processing a task. It is represented by a real valued quantity in the range
[0-1], where 0 denotes that a processor is completely unavailable (0% available) to process
a task, and 1 denotes that a processor is completely available (100% available) to process a
task. In the constant-availability model, the availability of the processors do not vary with
time and remains a constant during the course of the simulation. In the varying-availability
model, the availability of the processors varies over time. In the absence of systemic variance, all the processors are completely available towards processing their tasks. Figure 5.7
and Figure 5.8 are examples of a constant-availability model, where the availability of the
processors follows a uniform and an exponential distribution, respectively. The parameters of the uniform distribution are [0.1,1.0]. Exponential values are generated using 4
different exponential generators with mean values [0.3,0.7,0.45,0.25]. Table 5.2 shows the
number of processors available (as a fraction of the total number of processors) for every

86

10 % increase in processor availability starting from 10 %. As the numbers indicate, the
exponential distribution creates more load imbalance in the system.

Figure 5.7
Constant processor availability - uniform distribution

The variation in the processor availability over time is shown in Figure 5.9 and Figure 5.10. In both fgures, the y-axes represent one period, where a period is the amount
of time after which the variation in processor availability repeats itself. In the uniform
variation model, the availability of each processor changes every 20 seconds, and after 100
seconds, the variation pattern is repeated. In the exponential variation model, the availability of each processor changes at different rate, and each processor has a different period.
In this case, the y-axis represents the largest period of a processor in the system, and it is
possible, that within this time, other processors in the system undergo variations in their
87

Figure 5.8
Constant processor availability - exponential distribution

availability following multiple different periods. The average availability of each processor
was chosen according to an exponential distribution of mean 0.6 and bounded within the
interval [0.1,1.0], which is used in generating the variation in processor availability over
time. The exponential variation model has more load imbalance than the uniform variation
model. In both availability models, 4k and 8k processors are subsets of 16k processors,
and hence their systemic variance is a subset of 16k processors.
The Algorithmic variance translates into a variable amount of computational effort required to process each loop iteration and can be modeled using a probability distribution.
Only two models exist: a constant execution time model, where each loop iteration requires same computational effort, and a variable execution time model, where processing
each loop iteration require different amount of computational effort. However, several
88

Figure 5.9
Variable processor availability - uniform distribution

Figure 5.10
Variable processor availability - exponential distribution
89

possibilities exist for varying the iteration execution time, and two cases were considered:
Gaussian distribution with the mean of 2x109 and standard deviation of 6x108 , and exponential distribution with the mean of 3x109 . The mean values represent (in FLOPs) the
average amount of computational effort required to process a loop iteration.
In general, STATIC is known to incur the least scheduling cost and in the absence of
runtime load imbalance is expected to perform marginally better than other DLS algorithms. Similar performance of non-adaptive algorithms (FSC, GSS, FAC) in the absence
of load imbalance was also observed in [70] for fewer processors. As soon as there is some
load imbalance in the system, the DLS algorithms outperform STATIC. GSS allocates
large chunks in the beginning, and if those chunks are time consuming, the performance of
GSS is only comparable to STATIC. This highlights the disadvantage of using decreasing
size chunks with time consuming iterations near the beginning of the loop. FSC allocates
chunks of equal size, and is a compromise between STATIC and self-scheduling, which
allocates chunks of unit size. If an optimal fxed chunk size can be found, the performance
of FSC can be on par with the adaptive DLS algorithms. Intuitively, the fxed size chunk
must be small enough to not overburden any processor(s), but big enough to not suffer
from large communication overhead that affects self-scheduling. The initial chunk size of
FAC is half the size of the largest GSS chunk and hence does not suffer as much as GSS
when more time consuming loop iterations are at the beginning of the loop. Hence, FAC
performs much better than GSS. A similar observation was also reported in [70] for fewer
processors.

90

Figure 5.11
Performance of the DLS algorithms with constant iterations execution times and constant
processor availability (uniform distribution)

Figure 5.12
Performance of the DLS algorithms with constant iterations execution times and variable
processor availability (uniform distribution)

91

Figure 5.13
Performance of the DLS algorithms with Gaussian iterations execution times and constant
processor availability (uniform distribution)

Figure 5.14
Performance of the DLS algorithms with Gaussian iterations execution times and variable
processor availability (uniform distribution)

92

The variants of FAC (WF, AWF-B, AWF-C, and AF) have been designed to address
heterogeneity, and hence, in general, perform better than FAC. When the processor availability varies during the execution of the loop iterations, the adaptive variants (AWF-B and
AWF-C) outperform WF ( Figure 5.12 and Figure 5.14). In the absence of such variation,
the performance of WF, AWF-B, AWF-C, and AF is comparable (Figure 5.11 and Figure 5.13). FAC assumes that the mean and standard deviation of the iteration execution
times are known a priori, and are the same on all the processors. This assumption is relaxed in AF which dynamically estimates these statistics, and hence in general, is expected
to perform better than FAC, AWF-B, and AWF-C. However, Figure 5.12 and Figure 5.14
show the opposite. This may be attributed to the absence of a large degree of load imbalance during runtime. Even though, the availability of the processors varies, it does so in a
uniform fashion, and hence, there is no rapid load variation in the system. To verify this
hypothesis, both algorithmic and systemic variances were varied in an exponential fashion
(a non-uniform variation). When the load variation is increased, an equilibrium point is
reached, where AF, AWF-B, and AWF-C have similar performance ( Figure 5.15). On
further increase in load variation, AF starts to perform better than AWF-B and AWF-C (
Figure 5.16). For the highest unpredictable variation in the system or the application, AF
outperforms AWF-B and AWF-C (Figure 5.17)). For these cases, the performance of FAC
is shown as a basis for comparison. The best performance improvement over FAC was
achieved by AF for 8192 processors (≈50%) (Figure 5.17)).

93

Figure 5.15
Performance of the factoring based DLS algorithms with exponential iterations execution
times and constant processor availability (exponential distribution)

Figure 5.16
Performance of the factoring based DLS algorithms with exponential iterations execution
times and variable processor availability (exponential distribution)

94

Figure 5.17
Performance of the factoring based DLS algorithms with exponential iterations execution
times and variable processor availability (exponential distribution)

5.5

Conclusions
In this chapter, a scalability analysis of the DLT and the DLS algorithms were pre-

sented. The DLT algorithms were employed to schedule two NAS benchmarks, namely,
the EP and the IS benchmark. The deterministic nature of the DLT was leveraged for
performance prediction and two questions related to the scalability analysis of the two
NAS benchmarks were addressed: (1) What is the fastest time to execute the application?
and (2) How should the application size be scaled in relation to the system size in order
to maintain a constant effciency?. For larger system sizes, the overhead associated with
DLT, namely, the time required to identify the load fraction can be high in comparison to
the time required to actually solve the problem.
The effectiveness of DLS algorithms in addressing the different sources of load imbalances, such as the ones generated by algorithmic and systemic variance has been demon95

strated in prior works, albeit on a smaller scale. The goal of this work was to study the
scalability of DLS algorithms in the context of larger scale problems and systems. A scalability analysis was performed on 4096, 8192, and 16384 processors with a problem size
of 16 million, 64 million, and 256 million tasks. The adaptive algorithms are expected
to perform better than the non-adaptive algorithms and is substantiated by the simulation
results. Among the adaptive algorithms, when there is more unpredictable variation in the
application or in the system, the AF technique which has a stronger theoretical foundation
performed better than other adaptive variants, such as AWF-B and AWF-C. Another interesting observation was the performance of FSC which is a non-adaptive technique, can be
on par with adaptive algorithms when an optimal fxed size chunk size can be found.

96

CHAPTER 6
ROBUSTNESS ANALYSIS OF DLT ALGORITHMS

High performance parallel and distributed computing systems that compute arbitrarily
divisible workloads operate in an environment characterized by unpredictable variations
(or perturbations) in system load, unexpected resource failures, and others. In this chapter, we address the problem of predicting and evaluating the robustness of divisible load
scheduling with respect to perturbations in various performance impacting factors.

6.1

Why Robustness?
The motivation for the robustness study stems from the fact that high performance par-

allel and distributed computing systems may operate in an environment characterized by
unpredictable variations (or perturbations) in system load, network link latency and bandwidth, unexpected resource failures, and others. Such perturbations can degrade certain
system performance features [40]. Traditionally used performance metrics such as execution time, cost, speedup, and effciency can provide valuable insights into the workings of
an algorithm. In this chapter, we present the robustness study of DLT algorithms for which
the traditional performance metrics, however, not suitable and which require a different
type of metric. Several defnitions for robustness exist. We understand a robust system as

97

a system capable of delivering a certain level of performance despite the fuctuations in the
operating environment [47].

6.2

FePIA procedure
In this section, we present an overview of the FePIA procedure proposed in [8]. The

FePIA procedure is named after the four steps that comprise it and is a commonly used
procedure for evaluating the robustness of scheduling algorithms. The performance feature
and perturbation parameters considered in analyzing the robustness of the DLT algorithms
are also discussed. The relevant notation is given in Table 6.1.
The FePIA procedure comprises the following steps:
1. Identify the performance features that make the system robust The performance features are aspects of interest to the scheduling algorithm designer, system user, and
others. The performance features are impacted by any variation in the runtime characteristics of the operating environment. They are represented by the set Φ, and each
element φi ∈ Φ denotes a performance feature. The tolerable bounds of a perfor, βφmax
>.
mance feature φi ∈ Φ are defned by the 2-tuple <βφmin
i
i
2. Identify all the parameters whose perturbations affect the performance features identifed in the previous step These parameters are called perturbation parameters and
constitute the elements of the set Π. Each element πj ∈ Π denotes a perturbation
parameter.

98

3. Identify the impact of the perturbation parameters on the performance features Formally, in this step, ∀ φi ∈ Φ, the relation φi = fij (πj ) is identifed, if such a relation
exists.
4. Analyze the impact of the perturbation parameters on the performance features to
=
determine its robustness The boundary values of πj that satisfy the relations: βφmin
i
fij (πj ) and βφmax
= fij (πj ) for every φi ∈ Φ and πj ∈ Π must be determined in order
i
to analyze this impact.
If πjorig represents the original value of the perturbation parameter πj at which the system was assumed to operate at its best, then the robustness radius of a workload scheduling
scheme µ, with respect to the performance feature φi against the perturbation parameter πj ,
is defned as smallest deviation in πj that would cause φi to exceed the tolerable bounds
, βφmax
>. Mathematically, the robustness radius can be expressed as:
<βφmin
i
i
rµ (φi , πj ) =

min

max ∨f (π )=β min )
πj :(fij (πj )=βφ
ij
j
φi
i

||πj − πjorig ||1 .

(6.1)

The robustness metric of a workload scheduling scheme µ, with respect to the performance
feature set Φ against the perturbation parameter πj , is defned as the minimum of all robustness radii. Mathematically, the robustness metric can be expressed as:
ρµ (Φ, πj ) = min(rµ (φi , πj )).
φi ∈Φ

6.2.1

(6.2)

Performance feature and perturbation parameters

A DLT algorithm identifes a schedule based on the optimality principle, such that the
cost of the schedule - the time required to solve the problem under consideration - is mini99

Table 6.1
Glossary of robustness notation
Notation
Φ
φi
βφmin
, βφmax
i
i
Π
πj
µ
rµ (φi ,πj )
ρµ (Φ,πj )
γ πj

Explanation
set of performance features
a performance feature, φi ∈ Φ
tolerable bounds of φi
set of perturbation parameters
a perturbation parameter, πj ∈ Π
workload scheduling scheme
robustness radius of µ
robustness metric of µ
ratio of φi{πj } to φi

mal. In this work, we study the robustness of the DLT solution and therefore, an intuitive
choice for a performance feature is the overall execution time of the parallel application,
denoted as Tpar . In general, Tpar is affected by perturbations in processor availability and
network link characteristics. In this work, we consider the case where the perturbation
parameters vary individually as well as in combination during the execution of the application. Therefore, the perturbation parameters set includes processor availability, network
link characteristics and their combinations. Combinations of perturbation parameters are
considered by concatenating them into a single parameter which is then used in the analysis as discussed in 6.2. According to step c of the FEPIA procedure, for every perturbation parameter πj there is a functional relation between πj and Tpar , such that, ∀πj ∈ Π,
Tpar = f (πj ). The functional relation between πj and Tpar is problem-specifc and will be
discussed in detail in Section 6.4.

100

6.3

Modeling perturbations
In Section 6.2.1 we identifed the processor availability (R) and network link char-

acteristics (B) as perturbation parameters. A methodology is needed to generate runtime
variations in the perturbation parameters and there are several approaches for injecting
variations. In this work, we generate variations using trigonometric functions, such as
sin() and cos(), for the following reasons: (1) the results of the trigonometric functions
are naturally bound within the continuous interval [-1, 1]; by controlling the input value
x, the results can be restricted within the interval [0, 1] which allows the variation in a
perturbation parameter to be expressed as a percentage fraction of the maximum value of
that parameter, and (2) the continuous nature of the trigonometric functions allows to draw
as many samples as needed from their output.
Figure 6.1 shows the results of the trigonometric expression sin(x2 ) · cos(y 2 ) for x ∈
[0.8, 1] and y ∈ [0.15, 0.25] as a color coded heat map with values ranging as indicated
by the heat bar on the right hand side of the fgure. The (x, y) coordinates correspond to
the (resid , t) coordinates, respectively, where resid denotes the system resource (processor
or link) where perturbation πj ∈ Π occurs, and t represents the time instant when the
perturbation is observed. Since the sin() and cos() functions are continuous in nature, the
continuous intervals that contain (x, y) are discretized into fnite intervals for (resid , t) that
express the quantities of interest to us, which in fact represent a subset of the continuous
values produced by the trigonometric expression. As an illustrative example, in the case of
perturbation in processor availabilities R, the continuous range [0.8, 1.0] of the x-axis is
discretized (or sampled) into the discrete range [1, n], where n is the number of processors.
101

Figure 6.1
Left skewed variation generated with sin(x2 ) · cos(y 2 )

Similarly, the continuous range [0.15, 0.25] of the y-axis is discretized into a discrete
time interval [0, time] corresponding to the time interval during which the perturbation is
observed. Each of the output values of sin(res2id ) · cos(t2 ), for any (resid , t) point in the
discretized domain [1, n]×[0, time], is used to express the processor availability R for the
processor indicated by resid at the particular point in time indicated by t as a fraction of
the maximum value of its 100% availability.
The average value of the expression f (x, y) = sin(x2 ) · cos(y 2 ) is given by:
1
avg(f (x, y)) =
A(S)

ZZ

sin(x2 ) · cos(y 2 ) dA,

(6.3)

S

where S is the rectangle of dimensions 0.2 (x ∈ [0.8, 1.0]) and 0.1 (y ∈ [0.15, 0.25]) of the
left skewed variation, and A(S) represents the area of the region S. The avg(f (x, y)) =
102

0.722, and represents the average value (in %) of the perturbation parameter πj at any point
in time within [0, time] seconds during the simulation. For example, assuming the perturbations in processor availabilities follow the left skewed pattern, the average availability
of any processor is 72.2%, signifying that only 72.2% of its nominal processing power is
available for processing a task. Variations in the processor availability are considered to
simulate conditions such as OS jitter, sharing of resources, and others, which can have an
impact on the processor’s computing power.

Figure 6.2
Right skewed variation generated with cos(x2 ) · cos(y 2 )

The extrema of all average values of the expression sin(x2 ) · cos(y 2 ) are given by:
z
(b − a)

b

Z

cos(y 2 ) dy,
a

103

(6.4)

where [a, b] denotes the range [0.15, 0.25] of the y-axis, and z is the extremum of the
function sin(x2 ) for x ∈ [0.8, 1.0]. Since sin(x2 ) is strictly increasing within that interval,
the minimum and the maximum values of sin(x2 ) are given by evaluating sin(x2 ) at x =
0.8 and x = 1.0, respectively. Hence, the extrema of all the average values of the function
sin(x2 ) · cos(y 2 ) on the discrete domain [0.8, 1.0] × [0.15, 0.25] are 0.596 (minimum) and
0.840 (maximum), respectively. These values represent the bounds on the average value of
a perturbation parameter following the left skewed pattern.

Figure 6.3
Diagonal skewed variation generated with sin(x2 ) · sin(y 2 )

104

Figure 6.4
Non skewed variation generated with sin(x2 ) · cos(y 2 ) and cos(x2 ) · cos(y 2 )

6.4

Robustness prediction
In the traditional divisible load scheduling model, it is assumed that the workload for

processing resides at a single source and is partitioned according to the optimality principle and distributed to all available processing resources. This model of divisible load
scheduling is commonly referred to as single-source scheduling. In this work, we study
the robustness of single-source scheduling. The single source that contains the workload
is referred to as the master processor (p0 ) and the rest of the processors involved in the
processing of the workload are called the slave processors. The master processor is also
involved in processing the workload. Using a linear cost model for computation and an

105

affne cost model for communication, the execution time of a processor pi when the workload is partitioned by applying the divisible load theory is given by:
⎧
⎪
⎪
⎪ 0 · T cp0
⎪α
if i = 0
⎨
Ti =
X
⎪
⎪
⎪
(Lj + αi · T cmj ) + αi · T cpi otherwise
⎪
⎩

(6.5)

∀ j in r(p0 ,pi )

np −1

X

αi = n t ,

(6.6)

i=0

Table 6.2
Variations: bounds, average values, and extrema
Skewness
Bounds for x and y axes
Mean
left [0.80 ≤ x ≤ 1.0], [0.15 ≤ y ≤ 0.25] 0.722
right [0.80 ≤ x ≤ 1.0], [0.15 ≤ y ≤ 0.25] 0.682
diagonal [0.90 ≤ x ≤ 1.0], [0.90 ≤ y ≤ 1.00] 0.614
non skewed [0.80 ≤ x ≤ 1.0], [0.15 ≤ y ≤ 0.25] 0.748

Extrema
0.596, 0.840
0.539, 0.801
0.567, 0.659
0.688, 0.801

where r(psrc , pdst ) is an ordered list of links that represents the physical route from processor psrc to processor pdst . It is possible to use different cost models for computation
and communication. The linear and the affne cost models for computation and communication, respectively, are however, the most widely used models in DLT. The glossary of
notation is provided in Table 2.1 and Table 6.3.
The simulated platform in this work models a homogeneous 3D torus topology. The
homogeneous nature of the platform implies that the processors and the communication
links are indistinguishable with regards to processing and transferring a task, respectively.
Hence, from hereon, the notations R, T cp, L, B, and T cm, are used to represent the proces106

Table 6.3
Glossary of DLT notation for 3D torus topology
Notation
h
hmax
ph
nh
H
αh
Th
α

Explanation
distance between two processors in hops
maximum number of hops
a processor that is h hops away from processor p0
number of processors that are h hops away from processor p0
set of valid hops for a given topology
load fraction of a processor that is h hops away from processor p0
execution time of a processor that is h hops away from processor p0
load distribution: an (np ) ordered tuple (α0 , α1 , · · · αnp −1 )

sor rating, the processing time of a task on a processor, the latency of a link, the bandwidth
of a link, and the transfer time of a task over a link, respectively. The platform homogeneity
in the absence of congestion in the communication links leads to the following theorem.

6.4.1

Theorem 1

Two processors pa and pb that are each h hops away from the master processor p0 will
receive an equal fraction of the load for processing.
Proof: The execution times of the two processors pa and pb are given by:
Ta = h · (L + αa · T cm) + αa · T cp, and

(6.7)

Tb = h · (L + αb · T cm) + αb · T cp.

(6.8)

According to the optimality principle of DLT, given an optimal load distribution, all pro6 αb , it would
cessors must fnish computing at the same time. This implies Ta = Tb . If αa =
imply that processors pa and pb spend different amounts of time in receiving and processing
the workload and yet have the same execution time. However, since the platform topol107

ogy is homogeneous and the processors are equidistant from the master processor p0 , it not
possible that they spend different amounts of time in communication and computation, and
that Ta = Tb holds. Hence, it must be the case that αa = αb .

Applying Theorem 1, Equation (6.6) can be rewritten as:
α0 +

|H|
X

nh · α h = n t .

(6.9)

h=1

Equation (6.9) expresses the same constraint as Equation (6.6) but has far fewer summation
terms. The size of the set H is platform dependent. For a 3D torus platform, |H| is given
by (nx + ny + nz )/2, where nx , ny , and nz represent the number of processors along the
x, y, and z axis, respectively.

6.4.2

Theorem 2

Let {ha , hb } ⊂ H such that ha < hb . Let pa and pb be two processors in the system
that are ha and hb hops away from the master processor p0 , respectively. If αa and αb
represent the load fractions of processors pa and pb , respectively, then αb < αa .
Proof: The execution times of the two processors pa and pb are given by:
Ta = ha · (L + αa · T cm) + αa · T cp, and

(6.10)

Tb = hb · (L + αb · T cm) + αb · T cp.

(6.11)

According to the optimality principle of DLT, given an optimal load distribution, all processors must fnish computing at the same time. This implies Ta = Tb . If αb ≥ αa ,
then Tb > Ta , and the optimality principle is violated. Any partition of the workload that
108

violates the optimality principle is not in accordance with the divisible load theory. Therefore, if the load allocation is optimal, then αb < αa . Thus, the sequence αa , αb is strictly
decreasing.

Table 6.4
Platform characteristics
Attribute
number of processors
number of communication links
processor power rating (GFLOPS)
link latency L (µseconds) [6]
link bandwidth B (GB/s) [6]
topology
dimensions (nx , ny , nz )

Value
4,096
12,288
2.0
8.0
1.5
3D torus
16, 16, 16

According to Theorem 2, the farther a processor from the master processor p0 , the lesser
the load fraction it will receive for processing. Hence, the master processor p0 will receive
the most load fraction and will spend the most amount of time in computation than any
other processor in the system. A processor phmax that is hmax hops away from the master
processor p0 will receive the least amount of workload for processing and will spend the
most amount of time in communication than any other processor in the system.
Equation (6.9) contains 1+|H| unknowns but can further be simplifed by applying the
optimality principle. The relationship between αh and α0 is given by:
αh =

α0 · T cp
h·L
−
.
h · T cm + T cp h · T cm + T cp
109

(6.12)

Following the substitution of αh , Equation (6.9) can be rewritten as:
α0 +

|H|
X
h=1


nh ·

α0 · T cp
h·L
−
h · T cm + T cp h · T cm + T cp


= nt .

(6.13)

Equation (6.13) can be solved for the unknown quantity α0 since the rest of the quantities
L, T cm, T cp, nh , and nt are known. The predicted values of α0 , αhmax , and Tpar for all
application types are shown in Table 6.5. The predicted value of αhmax is obtained using
Equation (6.12). Since the load is partitioned based on the optimality principle, all the
processors will fnish computing at the same time instant. Hence, the parallel execution
time Tpar is equal to the execution time of any processor in the system and is computed
as: α0 · T cp, which is the execution time of processor p0 . For theoretical analysis, we
use the load fractions as a foating point value, whereas, in simulation, these values are
rounded to integral values since a task is the smallest unit of execution and cannot be
further subdivided. The simulated platform characteristics are given in Table 6.4.
Table 6.5
Predicted values of load fractions and parallel execution time on a 3D torus topology
Application type
pure computation
computation-bound
intermediate
communication-bound

α0
(# of tasks)
4,096.01
253.03
2213.41
50,361.33

110

αhmax
(# of tasks)
4,095.98
235.78
113.29
98.17

Tpar
(sec)
19.78
10,847.29
15.28
308.46

From the application of Theorem 2, we identifed that processor p0 is most sensitive to
perturbations in processor availability (R) and processor phmax is most sensitive to perturbations in network link characteristics (B). Their execution times are given by:
T0 (R) = α0 · T cp, and

(6.14)

Thmax (R, B) = hmax · (L + αhmax · T cm) + αhmax · T cp.

(6.15)

T cp, which is the time required to compute a task on a processor, is defned as

wt
R

where

wt is the number of operations required to compute the task, and R is the rating of the
processor. Similarly T cm, which is the time required to transfer a task over a network link,
is defned as

zt
B

where zt is the size of the task and B is the bandwidth of the link. In the

presence of perturbations in system characteristics such as R and B, their actual values
are different from the nominal values that were used to calculate the load distribution, α.
Hence, the actual value of the performance feature Tpar in the presence of perturbations
will be different from the estimated value, and is given by:
Tpar (R, B) = max(T0 (R), Thmax (R, B)).

(6.16)

Note that Equation (6.15) is also a function of network link latency L. However, the
effect of perturbations in L only adds a negligible increase to Tpar , where as the effect
of perturbations in R and B are signifcant and becomes amplifed due to the presence of
the multiplicative factor (the load fraction). For this reason, we only study the effect of
perturbations in R and B.
The Figures 6.5-6.8 show the impact of perturbations in processor availability and
bandwidth on Tpar as flled contours for the four different application types considered.
111

Figure 6.5
Predicted values of γRB for the pure computation application

The x axis represents perturbations in the processor availability and the y axis represents
perturbations in the bandwidth availability. The bounds of both axes are in the decreasing
range [1.0, 0.5]. The value of 1 represents the unperturbed state in which there is no perturbation in the perturbation parameter (100% of the nominal value is available) and the
value of 0.5 represents a perturbed state in which the value of the perturbation parameter
is 50% of its unperturbed value (only 50% of the nominal value is available). The reason
for choosing 50% as the lower bound for perturbations in both R and B is that increasing
the perturbations beyond this bound will push the tolerable variation in Tpar by more than
100% which is not useful in practice. In the fgures, γRB represents the ratio of Tpar in the
presence of the perturbations to the Tpar in the absence of any perturbations in the system.
112

At any given point (x, y) in the fgures, Equation (6.16) is used for drawing the contour
curves. A number of observations from the Figures 6.5-6.8 are discussed below:

Figure 6.6
Predicted values of γRB for the computation-bound application

Observation 1: In Figures 6.5-6.8, the region enclosed by a contour curve, the x axis,
and the y axis represents a 2-dimensional area within which γRB is less than the value
of the contour curve. The bounds of the x and the y axes represent the extent to which
the perturbation parameters can vary without the bounds of the performance degradation
exceeding the value of the contour curve. As an illustration, in Figure 6.8, the bounds
of the region flled with the lighter shade of gray color are x = 0.833 and y = 0.833,
which implies that as long as the perturbations in the the processor availability and the link
113

Figure 6.7
Predicted values of γRB for the intermediate application

bandwidth are less than 83.3% and 83.3%, respectively of its unperturbed value, γRB will
not exceed 1.2. As the perturbations in the processor availability and bandwidth increase,
their impact on Tpar also increases, as shown by the flled contour regions of progressively
dark shades of gray color.
In the absence of perturbations in link bandwidth, the effect of perturbations in processor availability on all the four application types considered is the same, such that, a
50% reduction in the processor availability has an effect of doubling the overall execution
time, Tpar . Processor p0 is most sensitive to perturbations in processor availability than any
other processor (based on Theorem 2) and spends no time in communication since the load
originates from itself. As a result, the effect of perturbations in processor availability has
114

the same impact on all the four application types considered. However, the actual increase
in Tpar is not the same for different applications. For the pure computation application,
the actual increase in Tpar is 19.78 seconds and for the computation-bound application it is
10847.29 seconds when there is a 50% reduction in the processor availability.

Figure 6.8
Predicted values of γRB for the communication-bound application

Observation 2: The flled contour regions in Figure 6.5 are rectangular in shape. However, the contour curves are only parallel to the y axis. This signifes that perturbations in
link bandwidth have no impact on the performance feature, Tpar , since there is no signifcant communication between the processors in the pure computation application.

115

Observation 3: For the computation-bound application, the CCR is extremely low
(0.002) and therefore the application is more sensitive to perturbations in processor availability than to perturbations in bandwidth availability. As seen in Figure 6.6, a drop of
50% in the link bandwidth will cause an increase in Tpar in the range [Tpar − 1.2 ∗ Tpar ],
whereas the same drop in the processor availability will cause an increase in Tpar in the
range [1.8 ∗ Tpar − 2.0 ∗ Tpar ]. For a given value of processor availability, increasing
the perturbations in link bandwidth has no overall impact on Tpar up to a certain point.
For instance, when the processor availability is 0.833, increasing the perturbation in link
bandwidth until its value also reaches 0.833, has no overall impact on Tpar . The same
observation is also true for other values of processor availability.
Observation 4: The flled contour regions are approximately rectangular in shape. From
Figure 6.7, it is clear that there is more room to vary the perturbations along the y axis than
along the x axis suggesting that the intermediate application is marginally more sensitive
to the perturbations in processor availability than the perturbations in the bandwidth availability.
Observation 5: The flled contour regions in Figure 6.8 are square shaped. Equations
(6.14) and (6.15) are sensitive to perturbations in processor availability and link bandwidth,
respectively, and their average rates of change (ARC) over an identical interval are equal.
The ARC of a function f (x) over an interval [a,b] is given by

f (b)−f (a)
.
b−a

After substituting

the known values from Tables 6.4 and 6.5, the ARC of Equation (6.14) over an interval
[1, a] along any line y = a is − 338.23
and that of Equation (6.15) over the same interval
a
along any line x = a is − 337.61
, where 0.5 ≤ a < 1.0. Equal ARC values of Equation
a
116

(6.14) and Equation (6.15) indicate the following: (1) The processor phmax spends almost
its entire time in communication (waiting to receive the data for processing) and the time
spent in computation is rather insignifcant such that any perturbation in the processor
availability has no overall impact on Thmax . In the unperturbed state, phmax spends only
0.628 seconds in computation but 307.83 seconds in communication or ≈ 99.79% of the
time is spent in communication. (2) For a given level of perturbation a in one of the
perturbation parameters, increasing the perturbation level on the other parameter has no
overall impact on the execution time of the application until its perturbation level also
reaches a.

Figure 6.9
Predicted and simulation values of γR , γB , γRB for the pure computation application

117

6.5

Simulation results
The design of simulation experiments is summarized in Table 6.6. For the pure compu-

tation application which simulates the EP NAS parallel benchmark, the number of divisible
tasks is 224 which corresponds to the problem class E (the largest problem class for that
benchmark). For the other types of applications, the number of divisible tasks are 106 .
Figures 6.9 - 6.12 show the γR , γB , and γRB for the four applications when the variation in
the perturbation parameters follows the left, right, diagonal, and the non skewed variation
patterns as described in Section 6.3. γRB is defned as the ratio of Tpar when perturbations
are present in both the performance impacting factors (processor availability (R) and link
bandwidth (B)) to the Tpar in the absence of perturbations. γR and γB are defned similarly
when perturbations are present in only one of the performance impacting factors. Based
on Equations (6.14)-(6.16), γR , γB , and γRB can be expressed as follows:
Table 6.6
Design of experiments to evaluate the robustness of DLT algorithms
Value
pure computation: 224
Divisible tasks nt (#)
other applications: 106
4,096
System size np (processors)
System topology
3D torus
Performance features set Φ
{Tpar }
Perturbation parameters set Π {R, B, RB}
Variations per perturbation (#) 4 (as shown in Figures 6.1-6.4)
Parameter

118

Figure 6.10
Predicted and simulation values of γR , γB , γRB for the computation-bound application

γR =

1
,
x

(6.17)

αhmax
R (1 − y)
· CCR · ·
, and
B
y
α0
αh
(1 − x)
= max(γR , γB + max ·
).
x
α0

γB = 1 + hmax ·

(6.18)

γRB

(6.19)

where 0.5 ≤ x, y ≤ 1.0. In the absence of perturbations γR , γB , and γRB are 1. Based on
Equations (6.17) and (6.19), γRB ≥ γR . Based on equations (6.18) and (6.19), γRB > γB .

119

Based on Equations (6.17) and (6.18), inequality γR > γB holds when the perturbations
in processor availability matches the perturbations in link bandwidth, when the following
condition is true:
hmax ·

αhmax
R
· CCR · < 1.
α0
B

(6.20)

Figure 6.11
Predicted and simulation values of γR , γB , γRB for the intermediate application

The values for the notations in Equation (6.17)-(6.20) can be found in Tables 4.1, 6.4, and
6.5. Based on those values, the condition represented by Equation (6.20) is always true.
This indicates that the effect of the perturbations in processor availability is more profound
than the effect of the perturbations in network link bandwidth within the context of the
applications, platform, and the scheduling methodology considered.
120

Figure 6.12
Predicted and simulation values of γR , γB , γRB for the communication-bound application

In Figures 6.9 - 6.12, each bar represents the overlap of green, blue, and red bars and is
obtained by layering the blue and the red bars over the green bar. The heights of the green
and the red bars represent the predicted min and predicted max values, and the height of
the blue bars represents the values obtained via simulation. Since the green, blue, and the
red bars overlap, only one of them is shown in the case when the bars have an equal value.
When the predicted min value matches the simulation value, only the green bar is shown
(Figure 6.10). Similarly, when the simulated value matches the predicted max value, only
the blue is bar is shown (Figure 6.10). When the predicted min, simulation, and predicted
max values are equal, only the green bar is shown (Figure 6.9).

121

The predicted values are obtained from Figures 6.5 - 6.8. For instance, the predicted
values of γR in Figure 6.9 are obtained by using the minimum and maximum values from
Table 6.2 in Figure 6.5 along the line where bandwidth availability is 100% (that is along
the line y = 1.0). Other predicted values (green and red bars) are obtained in a similar
fashion. In all Figures 6.9 - 6.12, the blue bars (values obtained via simulation) are either
within the analytically predicted bounds (green and red bars) or overlap one of them. This
indicates the applicability of DLT for predicting the performance even in situations where
there are perturbations in performance impacting factors. The analysis in this section is
based on the values obtained via simulation. The fgures reveal a number of interesting
observations.
Observation 1: The ratio γγBR is an indicator of the relative sensitivity of the DLT solution
to perturbations in network link bandwidth with respect to perturbations in processor availability, and is a function of CCR. A value < 1 for this ratio indicates that the DLT solution
is more sensitive to perturbations in processor availability. A value > 1 indicates that the
DLT solution is more sensitive to perturbations in network link bandwidth. A value of 1 for
this ratio indicates that the DLT solution is equally sensitive to perturbations in both performance impacting factors. From Figures 6.9 - 6.12, it is evident that with increasing CCR,
the ratio

γB
γR

also increases and has the highest value for the communication-bound applica-

tion. In general, the ratio

γB
γR

is < 1 (the middle blue bar is shorter than the left blue bar in

each of the Figures 6.9 - 6.12 for all the variation patterns) except for the communicationbound application when the perturbations follow the right skewed variation pattern. DLT
partitions the workload such that in the absence of perturbations, all processors fnish com122

puting at the same time. As a result, the sum of the computation and communication times
for all the processors are equal. This also suggests that when the perturbation intensities are
equal, the effect of perturbations in processor availability on processor p0 will be greater
than the effect of the same perturbations in network link bandwidth on processor phmax . In
all variation patterns except the right skewed pattern, the perturbation intensity is highest
on processor p0 than on any other processor in the system and hence the ratio

γB
γR

is <

1. For the right skewed variation pattern, the perturbation intensity on processor p0 is the
lowest, which in conjunction with high CCR for the communication-bound application,
causes the ratio

γB
γR

to be > 1. It must be noted that the ratio

γB
γR

is dependent on the CCR

and not on the communication requirement of a task (zt ). The computation-bound and the
communication-bound applications have the same communication requirements while the
communication-bound application has higher
Observation 2: The ratio

γR
γRB

γB
γR

ratio due to higher CCR.

is an indicator of the sensitivity of the DLT solution to

perturbations in network link bandwidth considering that there are perturbations in processor availability. The theoretical range for the value of this ratio is (0,1] based on Equations
(6.17) and (6.19). A value of 1 indicates that the DLT solution is not sensitive to perturbations in network link bandwidth since they have no impact on the overall execution time
of the application. The lower the value of this ratio, the higher the sensitivity of the DLT
solution to perturbations in network link bandwidth considering that there are simultaneous perturbations in processor availability. The ratio is ≈ 1 in all cases except in the case
of the right skewed variation pattern for the intermediate and the communication-bound
applications when the ratio is < 1. Figures 6.5 - 6.8 can also be used to deduce the value
123

of this ratio. From the fgures, the values of the contour curves at points (a, 1) and (a, a)
are equal when 0.5 ≤ a ≤ 1.0. A value < 1 of the ratio

γR
γRB

variation is attributed to the same reason that causes the ratio

in the case of right skewed
γB
γR

to exceed the value of 1

as discussed above.
Observation 3: The ratio

γB
γRB

is an indicator of the sensitivity of the DLT solution to

perturbations in processor availability considering that there are perturbations in network
link bandwidth. The theoretical range for the value of this ratio is (0,1) based on Equations
(6.18) and (6.19). In all cases, the value of this ratio is < 1 concurring with the theoretical
prediction. Since γB is a function of CCR, this ratio increases with the increase in CCR.
Table 6.7
Robustness analysis

rDLT (Tpar , R)
rDLT (Tpar , B)
rDLT (Tpar , RB)
ρDLT (Tpar , R)
ρDLT (Tpar , B)
ρDLT (Tpar , RB)

6.5.1

CCR=0
0.748
N/A
0.748
0.748
N/A
0.748

CCR=0.002
0.748
0.682
0.748
0.748
0.682
0.748

CCR=0.579
0.748
0.682
0.748
0.748
0.682
0.748

CCR=16.0
0.748
0.748
0.748
0.748
0.748
0.748

Robustness analysis

, βφmax
Let us suppose the bounds <βφmin
> of the tolerable impact on the performance
i
i
feature Tpar are <1.0, 1.5>. Then by Equation (6.1), there are three robustness radii for
each application, one for each perturbation parameter R, B, and RB in Π. Similarly, by
124

Equation (6.2), there are three robustness metrics for each application. The robustness
radius of DLT with respect to φi = Tpar for each perturbation parameter represents the
maximum deviation of the perturbation parameter from its unperturbed state without causing the Tpar to exceed the tolerable bounds set above. The robustness metric of DLT with
respect to Φ for each perturbation parameter in Π is the minimum of all robustness radii.
Since Φ contains only one element, Tpar , the robustness metric is equal to the robustness
radius.
The robustness radii and the robustness metrics for all the applications with respect
to the variation patterns considered are given in Table 6.7. In most cases, their values
rDLT (Tpar , πj ) and ρDLT (Tpar , πj ) correspond to the average value of the non skewed variation pattern (0.748) (see Table 6.2) and in some cases, correspond to the average value
of the right skewed variation pattern (0.682). The left and the diagonal skewed variation
cause Tpar to exceed the acceptable bounds as indicated by the blue bars in Figures 6.9 6.12. The perturbation parameter, B, has no impact on the pure computation application,
and hence rDLT (Tpar , B) and ρDLT (Tpar , B) are not defned for that application.
The use of four types of variation patterns reveals an interesting observation: the perturbation intensity alone is not necessarily a measure of its impact on Tpar but also its
concentration relative to the system resources. The average value of the left skewed variation pattern (0.722) is higher than the average value of the right skewed variation pattern
(0.682), and yet, in certain cases, the left skewed variation pattern has an higher impact on
Tpar . For instance, for the intermediate and the communication-bound application, γR is

125

higher for the left skewed variation pattern than it is for the right skewed variation pattern
as seen in Figures 6.11 and 6.12.

6.6

Conclusions
The goal of this study is to analytically predict and empirically evaluate the robustness

of divisible load theory (DLT) algorithms when applied to schedule arbitrarily divisible
workloads in an environment characterized by perturbations in the operating conditions.
DLT was employed to schedule applications that exhibit different CCRs ranging from pure
computation to communication-bound onto a target platform modeled as a 3D torus topology. The deterministic nature of the DLT was leveraged for the analytical prediction of
the robustness of the DLT algorithms. The results indicate that the robustness observed
via simulation is always within the analytically predicted range for the given applications,
system, and perturbations. The results emphasize the predictive power of the divisible load
theory numerical model even in environments where the actual values of the system parameters are different from the nominal values. This study can further be augmented by
studying the effects of congestion in network links, which was not considered herein.

126

CHAPTER 7
EFFECT OF TOPOLOGY ON SCHEDULING

In this chapter, the effect of the topology on the performance of DLT and AF algorithms
in the context of master-workers paradigm is studied. A topology has several properties,
such as, the diameter, the bisection width, the bisection bandwidth, and others. In this
study, our focus is on routing and the congestion that may result from it which has an
impact on communication costs. The routing algorithm is dependent on the topology of
the network. The choice of AF algorithm is motivated by its strong theoretical foundation
among all the DLS algorithms. Finally, some recommendations to alleviate congestion are
also presented.

7.1

Routing and congestion
The DLT and the AF algorithms consider the work for processing reside at a sin-

gle location and hence amenable for a master-workers scheduling paradigm. A one-to-all
communication in this scheduling paradigm can lead to congestion in the network links
depending upon the volume of the data transfer, the platform topology, and the routing
scheme. Three topologies are considered, namely, star, 3D torus, and fat-tree, and study
their effects on the scheduling of three applications described in Section 4.1.3. For our
evaluation purpose, we only consider the case where the congestion arises due to schedul127

ing within an application and not due to the presence of network packets that may originate
from other applications in the system.
SimGrid provides the ability to model the sharing policy of a network link as either
shared or fatpipe. When a network link is shared, network fows (or network packets)
fowing through a link at a same time instant will receive a portion of that link’s bandwidth
and in the case of fatpipe, a network fow will receive the entire bandwidth of that link.
In the case of shared policy, the bandwidth of a fow is dependent on the presence of
other fows through the link at the same time instant, and it is independent in the case of
fatpipe policy. The design of experiments to study the impact of congestion on scheduling
performance is presented in Table 7.1. In all the experiments, shared policy is used to
simulate the presence of congestion and fatpipe policy is used to simulate the absence of
congestion. Modeling a network link as a fatpipe provides a basis for comparison.
Table 7.1
Design of experiments to study the impact of congestion
Value
2.5·105 , 5.0·105 , 1.0·106
computation-bound
intermediate
Applications
communication-bound
256, 512, 1024, 2048, 4096
System size np (processors)
star, 3D torus, fat-tree
System topology
Node power rating (GFLOPS) 1.72
Link latency L (µseconds) [6] 8
Link bandwidth B (GB/s) [6] 1.5
Parameter
Divisible tasks nt (#)

128

7.2

Star network
In a star network, the root node is directly connected to all its children nodes. In this

topology, the natural choice for the master node is the root node with the children nodes
being the worker nodes. The worker nodes are only one hop away from the master node.
Routing is straight forward with the master node utilizing the dedicated network links to
communicate with the worker nodes. The network links are dedicated since only the traffc
intended for the worker nodes utilizes those network links that connects them with the
master node. The dedicated nature of the network links makes it congestion free when an
application is scheduled using DLT or AF algorithms. The simulation results also confrm
that there are no effects of congestion on the performance of DLT and AF algorithms.

7.3

3D torus network
In a 3D torus network, each node is connected to six other nodes, two nodes along

each of the x, y, and z axes. In this topology, there is no natural choice for the master node
and the node with id = 0 is used as the master node. One way to route the messages in
this topology is through the use of XYZ routing. Unlike in the case of star network, the
network links are not dedicated and can transfer messages that are not intended for either
of the nodes that are at the end points of a network link. This can lead to congestion when
large volume of data transfer is involved. The 3D torus topology is described in detail in
Section 4.2.3.

129

7.3.1

Dimension order routing

In a 3D torus network, each node is directly connected to six other neighboring nodes
and hence each node can communicate with its neighbors simultaneously. However, in order for the network links to be utilized effciently, the routing scheme must take advantage
of the topological properties of the network. The XYZ routing scheme can result in poor
load imbalance if messages are strictly routed along the x axis frst, the y axis next, and
fnally along the z axis.
Each node pi in the network has a unique representation (pix , piy , piz ), such that 0 ≤
pix < nx , 0 ≤ piy < ny , and 0 ≤ piz < nz where nx , ny , nz represent the number of nodes
along x, y, and z axes, respectively. The master node has the representation (0, 0, 0).
When a one-to-all communication is initiated from the master node to distribute the tasks,
a certain number of messages travel along each of the x, y, and z axes, respectively, such
that their sum is equal to np − 1. In the case of XYZ routing scheme, the number of
messages that travels along x axis frst is equal to the number of worker nodes pw such that
pwx 6= 0. The number of such worker nodes is given by: (nx − 1) · ny · nz . Since there are
two communication links along x axis, the maximum number of messages traveling along
in either direction of x axis is given by d nx2−1 e·ny ·nz . This number is the congestion factor
cf and represents the number of messages that compete for the network link bandwidth.
For instance, in a 3D torus network with 64 nodes, such that nx = ny = nz = 4, the
bandwidth for each communication traveling along the x axis will be reduced by a factor
of 32 due to the presence of other messages that travel simultaneously. The congestion

130

factor cf is dependent on the number of nodes in the network and hence increases with the
increase in the system size.
An interesting question is whether the location of the master node has an impact on
the scheduling performance when XYZ routing scheme is employed. Let us suppose that
the master node has the representation (pmx , pmy , pmz ). The number of worker nodes pw
whose pwx =
6 pmx is given by (nx − 1) · ny · nz . Hence, the congestion factor cf will be
the same as in the case where the master node has the representation (0, 0, 0). This implies
that the nodes in the network are indistinguishable and the location of the master node in
the network has no impact on the scheduling performance when XYZ routing scheme is
employed.

7.3.2

Impact of congestion

Figures 7.1 - 7.3 show the impact of congestion on the performance of DLT and AF algorithms when used to schedule the computation-bound, intermediate and the communicationbound application as described in 4.1.3. The x axis represents the number of processors
used to solve the application. The y axis represents the percentage reduction in the effciencies of DLT and AF algorithms due to the presence of congestion in the network links.
If ηnc and ηc represent the effciency in the absence and presence of congestion, then the
−ηc
) · 100. Figures 7.1 - 7.3
percentage reduction in the effciency can be computed as: ( ηncηnc

reveal the following:

1. Fine grained tasks are tasks with low CCR and are crucial to obtain high execution effciency from parallel processing. Conversely, coarse grained tasks can result in low
131

Figure 7.1
Performance degradation of DLT and AF algorithms due to congestion, on 3D torus
networks for the computation-bound application

execution effciency from parallel processing. As a result, the computation-bound
application is least impacted by congestion than the other types of applications, i.e.,
with the increase in the CCR, the impact of congestion also increases. This phenomenon is true for both the algorithms.
2. In general, for a given system size, the increase in the problem size (nt ) does not
have an impact on the reduction in the effciency. This is due to the fact that the
congestion factor cf is only dependent on the system size and not on the problem
size. This phenomenon is true for both the algorithms and for all the applications.

132

Figure 7.2
Performance degradation of DLT and AF algorithms due to congestion, on 3D torus
networks for the intermediate application

3. With the increase in the number of processors (np ), the impact of congestion also
increases due to the increase of the congestion factor cf . As a result, the percentage
effciency reduction increases with the increase in the number of processors. This
phenomenon is true for both the algorithms and for all the applications.
4. The impact of congestion in general is higher for the DLT algorithm than it is for the
AF algorithm. This is due to the presence of one-to-all communication in the DLT
algorithm where all tasks are distributed at once leading to congestion in the network
links. The AF algorithm on the other hand distributes tasks in smaller chunks which
in turn leads to lesser congestion in the network links.

133

Figure 7.3
Performance degradation of DLT and AF algorithms due to congestion, on 3D torus
networks for the communication-bound application

7.4

Fat-tree network
The fat-tree networks used in the simulations are 2-level fat-trees modeled after the

Stampede system at TACC. As in the case of 3D torus topology, there is no natural choice
for the master node and the node with id = 0 is considered as the master node. Similar
to the 3D torus topology, the network links are not dedicated which can lead to congestion
when large volumes of data transfer are involved. The fat-tree topology is described in
detail in Section 4.2.4.

7.4.1

Destination-mod-k routing

One way to route messages in this topology is through the use of destination-mod-k
routing scheme also known as D-mod-k routing. The D-mod-k routing scheme consists of
134

Figure 7.4
Performance degradation of DLT and AF algorithms due to congestion, on fat-tree
networks for the computation-bound application

two phases. In phase 1, a message is routed up to the common ancestor node between the
source node s and the destination node d. In phase 2, the message is routed down from
the common ancestor node to the destination node d. At each level k of the tree during
phase 1 of the communication, the parent node is identifed based on b Qk−d1 u c mod uk ,
i=0

i

where 1 ≤ k ≤ h and u0 = 1. In phase 2, at each level of the tree along the path from the
common ancestor node to the destination node d, the children nodes of the current node
are examined to see if the destination node d resides in the sub-tree of the child node.

135

Figure 7.5
Performance degradation of DLT and AF algorithms due to congestion, on fat-tree
networks for the intermediate application

Figure 7.6
Performance degradation of DLT and AF algorithms due to congestion, on fat-tree
networks for the communication-bound application
136

Figure 7.7
Bottleneck link in a fat-tree network

7.4.2

Impact of congestion

Figure 7.7 shows the bottleneck link (identifed by the red color) in a simple 2-level
fat-tree network with 4 compute nodes. Every message originating from node 0 regardless
of the destination node will use the link that connects the node 0 to the frst level switch
and hence it is a bottleneck link. For a one-to-all communication, the congestion factor cf
for this topology is (np − 1). Thus, with the increase in the system size, the congestion
factor will also increase linearly. Figures 7.4 - 7.6 reveal the following:
1. The impact of congestion on the performance of the DLT and AF algorithms is higher
for the fat-tree topology than it is for the 3D torus topology for a given application,
problem size and system size. This is due to the higher congestion factor cf for the
fat-tree topology: (np − 1) for the fat-tree topology vs. d nx2−1 e · ny · nz for the 3D
torus topology. Note that np = nx · ny · nz .
137

2. As in the case of 3D torus topology, the communication-bound application is impacted more by the congestion in the network links than the other types of applications. With the increase in the CCR, the impact of congestion also increases. This
phenomenon is true for both the algorithms.
3. In general, as in the case of 3D torus topology, the effect of increase in the problem and the system sizes has a similar impact on the performance of DLT and AF
algorithms for the same reasons.

7.5

How to alleviate congestion?
Congestion arises in the network links when large volumes of data are transferred and

when multiple network fows travels over the network links at the same time instant. Congestion in the network links reduces the bandwidth available to a network fow which in
turn increases its travel time. In this section, the effectiveness of two approaches, namely,
the multi-round divisible load scheduling and the round robin (RR) dimension order routing in reducing the congestion in the network links is evaluated. The multi-round divisible
load scheduling reduces the amount of data transferred at a given instant by sending the
data in smaller portions (or chunks). The RR dimension order routing reduces the number
of network fows going over network links at the same time instant by effectively utilizing
the topological properties of 3D torus networks.

138

7.5.1

Multi-round divisible load scheduling

Divisible load theory offers multi-round scheduling scheme where the data can be
distributed in multiple rounds in smaller chunks. Scheduling using multiple rounds can
reduce the congestion in the network links. Identifying the optimal number of rounds,
however, is a topic by itself and the theory does not provide a mechanism to identify
the optimal number of rounds. Some of the multi-round divisible load scheduling works
include [80] [39] [64]. We employ a simple heuristic to distribute the data in multiple
rounds. In the multi-round scheduling scheme, the master processor will distribute the
data in chunks of αmin = min(α0 , α1 , · · · αnp −1 ) tasks.

Figure 7.8
Comparative performance of single-round and multi-round DLT algorithms on 3D torus
and fat-tree networks for the computation-bound application

139

Figure 7.9
Comparative performance of single-round and multi-round DLT algorithms on 3D torus
and fat-tree networks for the intermediate application

Figures 7.8 - 7.10 show the percentage improvement in the effciency of DLT algorithm
on 3D torus and fat-tree networks when multi-round scheduling is employed over singleround scheduling. The red, blue, and green bars represent the improvement in the effciency
of multi-round DLT algorithm for the three problem sizes considered on 3D torus networks.
Similarly, the cyan, purple and yellow bars represent the same on fat-tree networks. Each
bar represents the overlap of two bars, for instance, the red and the cyan bars overlap when
the problem size is 2.5e5 tasks. Since the bars overlap, each bar will have a marginally
different colored bar than the six base colors that are shown in the legend of the fgures
depending upon which topology yields better performance improvement. For instance, in
Figure 7.8, when np = 4096 and nt = 1.0e6, the performance improvement is higher on
140

the fat-tree network and hence the yellow bar is taller than the lighter green bar which
represents the performance improvement on the 3D torus network. The lighter green bar is
due to the combination of the yellow and the green bars.

Figure 7.10
Comparative performance of single-round and multi-round DLT algorithms on 3D torus
and fat-tree networks for the communication-bound application

The improvements in the effciency are in the range [−32.91%−129.07%] and [0.001%−
8.63%] for the 3D torus and fat-tree networks, respectively. Figures 7.8 - 7.10 reveal the
following:
1. The congestion factor cf is independent of the problem size. As a result, for a
given system size, the effect of the congestion factor cf remains the same for all the
problem sizes. This results in very similar performance improvement for the DLT
141

multi-round algorithm on both the topologies though the performance improvement
is insignifcant on fat-tree topology.
2. The congestion factor cf is directly proportional to the system size and hence increases with the increase in the system size. For a given problem size, with the
increase in the system size, the performance improvement for the DLT multi-round
algorithm also increases. This is because, with the increase in the congestion factor
cf , the performance of the DLT single-round algorithm decreases.
3. In general, for all the applications, the DLT multi-round algorithm yields better performance than the single-round algorithm on 3D torus networks. Since the master
node can communicate with six other worker nodes simultaneously, reducing the
amount of data transferred decreases the communication time associated with transferring the data, which in turn, leads to a lesser overall execution time.
4. The multi-round DLT algorithm on fat-tree networks provides only marginal performance improvement: [0.001% − 8.63%] improvement in the effciency. With the
increase in the CCR, the performance improvement drops. As shown in Figure 7.7,
the link that connects the master node to the lowest level switch is the bottleneck
link and the entire data travels over that link. Thus, a single master-worker paradigm
might not be a best approach on fat-tree networks. Having more than one master
node, for instance, treating all the nodes connected to a lowest level switch as a
master might provide better performance on this topology and it is worth exploring.
In masters-workers paradigm, coordination between the master nodes needs to be
142

taken care of as well as the data storage strategy. The data storage strategy determines whether the data is duplicated in all the master nodes or partitioned among
them in a certain fashion.

7.5.2

Round robin (RR) dimension order routing

In Section 7.3.1, we described the behavior of the XYZ routing scheme. This routing scheme is straight-forward but can suffer from load imbalance, specifcally when a
collective communication, such as, a one-to-all communication is involved. The congestion factor cf which represents the maximum number of messages traveling simultaneously along one of the six communication links connected to the master node is given by:
d nx2−1 e · ny · nz . By utilizing all the six communication links connected to the master
node, the congestion factor cf can be reduced to

nx ·ny ·nz
.
6

The reduction in the congestion

factor cf can be achieved by modifying the XYZ routing scheme to equally choose any of
x, y, and z dimensions as the dimension along which a message will be routed frst. The
modifed routing scheme is referred to as round-robin (RR) dimension order routing and
messages are routed along XYZ or YZX or ZXY dimensions in a cyclic fashion.
Figures 7.11 - 7.13 show the percentage improvement in the effciency of DLT and AF
algorithms when the RR dimension order routing scheme is employed over the XYZ routing scheme on 3D torus networks. The red, blue, and green bars represent the improvement
in the effciency of DLT algorithm for the three problem sizes considered. Similarly, the
cyan, purple and yellow bars represent the same for the AF algorithm. Each bar represents
the overlap of two bars, for instance, the red and the cyan bars overlap when the prob143

Figure 7.11
Comparative performance of RR dimension order and XYZ routing schemes on 3D torus
networks for the computation-bound application

Figure 7.12
Comparative performance of RR dimension order and XYZ routing schemes on 3D torus
networks for the intermediate application
144

Figure 7.13
Comparative performance of RR dimension order and XYZ routing schemes on 3D torus
networks for the communication-bound application

lem size is 2.5e5 tasks. Since the bars overlap, each bar will have a marginally different
colored bar than the six base colors that are shown in the legend of the fgures depending
upon which algorithm yields more performance improvement. For instance, in Figure 7.11,
when np = 4096 and nt = 1.0e6, the performance improvement is higher for the AF algorithm and hence the yellow bar is taller than the lighter green bar which represents the
performance improvement of the DLT algorithm. The lighter green bar is due to the combination of the yellow and the green bars. Similarly, in Figure 7.11, when np = 4096 and
nt = 5.0e5, the performance improvement is higher for the DLT algorithm and hence the
blue bar is taller than the darker purple bar which represents the performance improvement
of the AF algorithm. The darker purple bar is due to the combination of the blue and the
145

purple bars. The improvements in the effciency are in the range [16.27% − 173.56%] and
[−0.65% − 142.38%] for the DLT and AF algorithms, respectively. Figures 7.11 - 7.13
reveal the following:

1. The congestion factor cf is independent of the problem size. As a result, for a
given system size, the effect of the congestion factor cf remains the same for all the
problem sizes. This results in very similar performance improvement for the DLT
algorithm. For all applications and for all system sizes, the heights of the red, blue
and green bars are very close to each other for all problem sizes. The AF algorithm
dynamically adapts to load imbalances and hence do not follow the same pattern.
2. The congestion factor cf is directly proportional to the system size and hence increases with the increase in the system size. For a given problem size, with the
increase in the system size, the performance improvement for the DLT algorithm
also increases. This is because, with the increase in the congestion factor cf , the
performance of the DLT algorithm using XYZ routing decreases. For all applications and for all problem sizes, the heights of the red, blue, and green bars increase
with the increase in the system size. The AF algorithm dynamically adapts to load
imbalances and hence do not follow the same pattern.
3. In general, for the DLT algorithm, the RR dimension order routing scheme yields a
signifcant performance improvement over the XYZ routing scheme for all applications. This is due to the decrease in the congestion factor cf with the RR dimension
order routing scheme.
146

4. In general, the performance improvement with RR dimension order routing for the
AF algorithm is not as much as it is with the DLT algorithm. In most cases, the
performance improvement is in single digit. This is due to the following:(a) unlike
the DLT algorithm, the AF algorithm does not distribute the tasks at once via a oneto-all communication and hence the amount of data fowing through the network
links at any given instant is not as much as it is in the case of DLT algorithm, and (b)
the AF algorithm dynamically adapts to load imbalances.

7.6

Conclusions
In this chapter, a study of the effect of topology on the performance of the scheduling

algorithms was presented. Routing algorithms are responsible for the routing of messages
in a network and it is dependent on the network topology. The nature of routing algorithms
combined with large volumes of data transfer and multiple network fows at a same time
instant give rise to congestion in the network links. The impact of congestion on the performance of DLT and AF algorithms was investigated on three topologies, namely, star, 3D
torus, and fat-tree. A new terminology, called the congestion factor cf which represents
the number of network fows that compete for the network link bandwidth was introduced.
In the star topology, the value of cf is 1 since the root node (which acts as the master
node) is connected to the worker nodes by dedicated links. The links are dedicated since
they carry only the traffc intended for the worker nodes. Since congestion is absent in
star networks, the DLT and the AF algorithm yielded a same performance. The other two
topologies, namely, the 3D torus and the fat-tree topologies, however do not have dedicated
147

links and hence the performance of DLT and AF algorithms are affected by congestion in
the network links. The effectiveness of two approaches in alleviating congestion, namely,
the multi-round DLT algorithm and the round robin (RR) dimension order routing scheme
for 3D torus topology was analyzed. On 3D torus networks, the multi-round scheduling
scheme and the RR dimension order routing scheme yielded performance improvements
in the range [−32.91% − 129.07%] and [−0.65% − 142.38%], respectively, suggesting
the effectiveness of these approaches for the 3D torus topology. On fat-tree networks, the
multi-round scheduling scheme yielded marginal performance improvement in the range
[0.001% − 8.63%]. On fat-tree networks, the link connecting the master node to the lowest
level switch is the bottleneck link and the entire data travels over this link. As a result, the
master-workers paradigm do not yield signifcant performance. A topic for further study
is to explore the effectiveness of masters-workers paradigm for the fat-tree topology.

148

CHAPTER 8
A HYBRID APPROACH TO DIVISIBLE LOAD SCHEDULING

In this chapter, we frst discuss the processor equivalence concept offered by divisible load theory (DLT) and then demonstrate how it can be applied to replace a network
with large number of elements with a simplifed equivalent network with fewer elements.
Utilizing equivalent networks, we present the design, analysis and evaluation of a hybrid
scheduling methodology that integrates the DLT and the DLS algorithms in a certain manner.

8.1

Processor equivalence principle
The DLT offers the concept of network equivalence similar to other linear models,

such as the Markovian queuing theory. In the equivalence network model, it is possible to
represent a complex network with an equivalent network element. As an example, if α1
and α2 represent the load fractions of two processors p1 and p2 , respectively, then p1 and
p2 can be replaced by an equivalent processor p1.2 whose runtime is given by:

T1.2 =

X

(Lj + (α1 + α2 ) · T cmj )

(8.1)

∀ j in r(p0 ,p1.2 )

+(α1 + α2 ) · T cp1.2
149

(8.2)

such that Tp1.2 = Tp1 = Tp2 . It must be noted that combining two processors into one
equivalent processor is possible due to the optimality principle. Using the same concept as
mentioned above, it is possible to replace an entire system into one equivalent element.

8.1.1

Star network

Figure 8.1
A star network and the corresponding equivalent network

Figure 8.1 shows a star network and the corresponding equivalent network obtained by
collapsing the root node and all the children nodes into a single equivalent node. The star
network consists of one root node and n children nodes. The equivalent network consists
of one node. The two networks are equivalent and have a same processing power. The
processing power of the equivalent processor peq is unknown and needs to be evaluated.
Let us consider a generic star network with np processors. In a star network, the root
node is directly connected to all its children nodes. More formally, a system of np processors (p0 , p1 ,. . . pnp −1 ) and np − 1 links (ln1 , ln2 ,. . . lnnp −1 ) are said to be interconnected
in a star fashion, if and only if a communication link li exists between the nodes p0 and
150

pi , where 0 < i < np . The ratings of processors (p0 , p1 ,. . . pnp −1 ) are denoted by (R0 ,
R1 ,. . . Rnp −1 ), and the latency and bandwidth of each network link lnj is denoted by Lj
and Bj , respectively, where 1 ≤ j < np .
When the workload is scheduled on a star network using the master-workers paradigm,
the root node is the natural choice for the master and the children nodes act as worker
processors. The execution times of the master and the worker processors when the load is
partitioned according to the optimality principle is given by:

Ti =

⎧
⎪
⎪
⎪
⎨α0 · T cp0

if i = 0

⎪
⎪
⎪
⎩(Li + αi · T cmi ) + αi · T cpi

else if i =
6 0

(8.3)

and
np −1

X

αi = 1.

(8.4)

i=0

where 0 ≤ i < np . Based on Equation (8.3), the relation between α0 and αk where
1 ≤ k < np is given by:

αk =

Lk
α0 · T cp0
−
.
T cpk + T cmk T cpk + T cmk

(8.5)

Substituting for αk in Equation (8.4), we get

np −1
1+

X
k=1
np −1

α0 =
1+

X
k=1

Lk
T cpk + T cmk
T cp0
T cpk + T cmk

151

.

(8.6)

One way to obtain an equivalent network from a star network is to collapse all the compute nodes into a single equivalent node. The resulting equivalent network will compute nt
tasks in the same amount of time as any node pi in the non-collapsed star network would
incur in processing αi tasks. Using Equation (8.6), the time required to compute a task on
the equivalent processor is given by:
T cpeq = α0 · T cp

8.1.2

(8.7)

3D torus network

For the 3D torus network, we apply the same technique as the star network and collapse
the entire network into a single equivalent node. As in the case of the star network, the
resulting equivalent network will compute nt tasks in the same amount of time as any node
pi in the non-collapsed 3D torus network would incur in processing αi tasks. Recall from
Section 6.4, in the case of homogeneous 3D torus network in the absence of congestion,
the following relation holds:

α0 +

|H|
X
h=1


nh ·

α0 · T cp
h·L
−
h · T cm + T cp h · T cm + T cp


= nt .

(8.8)

When the load fraction is normalized, nt = 1. After rearranging Equation (8.8), α0 can be
expressed as:

1+

|H|
X
h=1

α0 =
1+

|H|
X
h=1

h · nh · L
h · T cm + T cp
(8.9)
nh · T cp
h · T cm + T cp
152

Equation (8.9) expresses the load fraction of processor p0 in terms of all known quantities.
Using Equation (8.9), the time required to compute a task on the equivalent processor is
given by:
T cpeq = α0 · T cp

(8.10)

8.2 Proof by induction for star networks
In this section, we provide a proof by induction for T cpeq (expressed by Equation (8.7))
for homogeneous star networks where all processors in a network have a same rating R and
all the network links have a same latency and bandwidth values, L and B, respectively.
Proof: For all children nodes, nc ∈ Z+ , the time to compute a task on the equivalent
processor obtained by collapsing all the children nodes is given by:
T cpeq = α0 · T cp,

(8.11)

where α0 is the load fraction of processor p0 in the non-collapsed network.
Base case: This case occurs when nc = 1 such that the network consists of two processors p0 and p1 . If α0 and α1 represents their load fractions, then the following relations
hold:
T0 = α0 · T cp, and

(8.12)

T1 = L + α1 · (T cp + T cm).

(8.13)

153

The equivalent processor peq is obtained by combining the two processors p0 and p1 , such
that, the overall load is processed in the same amount of time, such that, Teq = T0 = T1 .
Since α0 + α1 = 1, Teq = T cpeq . Requiring Teq to equal to Equation (8.12), we have
T cpeq = α0 · T cp.

(8.14)

Inductive hypothesis: For any k ∈ Z+ , let us suppose the time required to compute a
task on the equivalent processor peq as denoted by Equation (8.7) is true.
Induction step: Let us suppose one more child node pk+1 is added to the network
already containing k children nodes such that nc = k + 1. The star network with k + 1
children nodes can be viewed as a star network with two nodes, namely, the root node peqk
and the child node pk+1 , where peqk is the equivalent node obtained by collapsing the star
network with k children nodes. If αk and αk+1 represent the load fraction of processors
peqk and pk+1 , respectively, then from the base case, the T cpeqk+1 is given by:
T cpeqk+1 = αk · T cpeqk .

(8.15)

From the inductive hypothesis, T cpeqk is given by:
T cpeqk =

α0
· T cp.
αk

(8.16)

Substituting Equation (8.16) in Equation (8.17), we have:
T cpeqk+1 = α0 · T cp.
Thus, Equation (8.11) holds for nc = k + 1 and the proof is complete.

154

(8.17)

8.3

Hybrid scheduling
Hybrid scheduling integrates the DLT and the DLS algorithms in a certain manner in

order to provide better performance. One of the reasons to apply DLT to schedule arbitrary
divisible workloads is its linear and deterministic nature which makes performance prediction possible. As long as the model parameters used for performance prediction remain the
same during the execution of the application, the predicted performance will concur with
the actual performance. However, even in the case when the model parameters vary, as we
saw in Chapter 6, it is possible to predict the performance bounds provided the variation
pattern is known. Dynamic loop scheduling (DLS) algorithms, on the other hand are designed to address load imbalances that arise during the execution of the application from
different algorithmic and systemic sources. In this chapter, we explore the possibility of
integrating the DLT and the DLS algorithms which we refer to as hybrid scheduling. The
hybrid scheduling, if effective, can enhance the applicability of the DLT algorithms even
in scenarios characterized by variations in the DLT model parameters. The performance
provided by the hybrid scheduling is expected to be closer to the predicted lower bound
than that is obtained by employing DLT algorithms in isolation. Thus, hybrid scheduling
can leverage the strength of DLS algorithms and can help improve the robustness of DLT
algorithms by providing better performance in combination.
Figure 8.2 shows the schematic diagram of hybrid scheduling. The hybrid scheduling
follows the master-workers paradigm where the workers are compound resources instead
of a single computing resource. A compound resource is defned as a collection of several single computing resources interconnected by network links in a certain topological
155

Figure 8.2
A schematic diagram illustrating hybrid scheduling

fashion. For instance, a compound resource can follow a star, 3D torus, fat-tree or any
other topology. It is possible that the compound resources are geographically distributed
like in the case of Grid computing or they can be within a single administrative domain.
We refer to this type of network that is formed as a collection of compound resources as
hybrid networks. In the hybrid scheduling, the master node is responsible for scheduling
the arbitrarily divisible workload. The master node knows about the collection of compound resources that are available to process the workload. The master node employs DLT
to partition the workload and in order to do so, it needs to be cognizant of the computing
power of the compound resources. Since the worker nodes are compound resources, their
computing power is evaluated using the processor equivalence concept of DLT. Sections
156

8.1.1 and 8.1.2 shows how to apply the processor equivalence concept to evaluate the computing power of a compound resource for star and 3D torus topologies, respectively. Once
its computing power is evaluated, the compound resource notifes the master node with
this information. Using the computing power of the compound resources, the master node
applies DLT to partition the workload and distributes them. The master node is unaware
of the scheduling policies used within a compound resource. A compound resource is free
to use any scheduling policy that it seems ft to use based on the workload and system
conditions that are local to it.
When a compound resource receives the workload for processing from the master node,
one of the nodes within the compound resource acts as a master node and the rest of the
nodes are considered worker nodes. A master-workers paradigm is implemented within
a compound resource. The master node inside a compound resource can employ any
scheduling methodology that it deems appropriate for the current workload and system
conditions. In this work, we consider two choices, namely, the DLT and the AF algorithms.
The choice of AF algorithm is motivated by its strong theoretical foundation among all the
DLS algorithms. Scheduling decisions within a compound resource can also be enhanced
by employing a portfolio-based approach that will dynamically select an appropriate DLS
technique based on the given workload and the current system characteristics. Such an
approach is investigated in [73] where a DLS algorithm is dynamically selected based on
the empirical prediction model built using supervised machine learning techniques.

157

8.4

Simulation results and analysis
The design of the simulation experiments for evaluating the performance of hybrid

scheduling is presented in Table 8.1. Three applications considered for the evaluation purposes are the computation-bound, the intermediate, and the communication-bound applications that were described in Section 4.1.3. Three types of hybrid networks are considered
each with four compound resources, such that, the frst hybrid network is comprised of
four star networks, the second hybrid network is comprised of four 3D torus networks, and
the third hybrid network is comprised of two star and two 3D torus networks. In each of
the compound resource, the processor availability is varied based on the variation patterns
described in Figures 6.1 - 6.4. As a result of the perturbations, the model parameters used
in DLT changes during the execution of the application.
Table 8.1
Design of experiments to evaluate hybrid scheduling

Value
computation-bound: {0.25, 0.5, 1.0} · 106
Divisible tasks nt (#)
intermediate: {1.0, 2.0, 4.0} · 106
communication-bound: {1.0, 2.0, 4.0} · 106
256 - 4,096
System size np (processors)
System topology
star and 3D torus networks
processor power rating (GFLOPS)
2.0
link latency L (µseconds) [6]
8.0
link bandwidth B (GB/s) [6]
1.5
back bone link latency L (µseconds) [6] 8.0
back bone link bandwidth B (GB/s)
15
Variations per perturbation (#)
4 (as shown in Figures 6.1-6.4)
Compound resources per hybrid network 4
Types of hybrid networks
3 (all star, 3D torus, star/3D torus networks)
Parameter

158

Figures 8.3 - 8.11 show the γR for the computation, intermediate, and communicationbound applications on three hybrid networks for three problem sizes. γR is defned as the
ratio of Tpar when perturbations are present in the performance impacting factor (processor
availability (R)) to the Tpar in the absence of perturbations. In all the fgures, the cyan bars
represent the performance of AF, the yellow bars represent the performance of DLT, and
the green and the red bars represent the theoretical lower and upper bounds, respectively.
The theoretical bounds were obtained using the bounds as described in Table 6.2. The time
required to compute a task on a compound resource for the three applications for different
system sizes is provided in Table 8.2 and it is same for both star and 3D torus networks.
This is because, congestion is not modeled in the study for 3D torus networks. Intuitively,
if the links and the compute nodes in both the networks have the same characteristics, then
a star network is expected to have a lower T cpeq than a 3D torus network of same size since
congestion is absent in star networks.
Table 8.2
Time (in sec) to process a task on an equivalent processor - T cpeq
XXX

XXX
X

np

XXX
topology-CCR
XX

star - 0.002
3D torus - 0.002
star - 0.579
3D torus - 0.579
star - 0.16
3D torus - 16

64

128

256

0.672
0.672
0.0002
0.0002
0.0016
0.0016

0.168
0.336
0.168
0.336
0.0001 5.579 ·10−5
0.0001 5.56 ·10−5
0.0009 0.0005
0.0009 0.0005

159

512

1024

0.084
0.084
3.188 ·10−5
3.184 ·10−5
0.00026
0.00026

0.042
0.042
1.994 ·10−5
1.993 ·10−5
0.00014
0.00014

8.4.1

Computation-bound application

In Figures 8.3 - 8.5, the performance of DLT and AF algorithms are within the theoretically predicted bounds. For all problem and system sizes considered, the performance
of AF is better than the performance of DLT. The performance improvement of AF over
DLT is in the range [2.85% − 11.25%]. The lowest improvement was seen on 3D torus star hybrid network when the problem size is 5.0 · 105 and the system size is 4,096. The
highest improvement was seen on star hybrid network when the problem size is 5.0 · 105
and the system size is 1,024.

Figure 8.3
Performance of hybrid scheduling on a hybrid network comprised of star topology for the
computation-bound application

160

Figure 8.4
Performance of hybrid scheduling on a hybrid network comprised of 3D torus topology
for the computation-bound application

In general, for a given hybrid network and a given system size np , the performance
improvement of AF over DLT is independent of the problem size, nt . This is because,
with the increase in the problem size, the computational cost associated with computing
the tasks also increases proportionally such that the ratio of computation to the communication costs is a constant. For instance, in the case of a hybrid network comprising of all
star networks, the ratio of computation cost to the communication cost is 51.42% for all
problem sizes when the system size is np = 256. Note that the above ratio decreases for
a given problem size with the increase in the system size and it is only 3.21% when np =
4096.

161

In general, for a given problem size, the performance improvement of AF over DLT
increases with the increase in the system size up to a point and then drops. For instance,
in the case of a hybrid network comprising of 3D torus networks, the performance improvements when np = 256, 512, and 1024 are 9.68%, 10.83% and 10.81%, respectively,
when nt = 1.0 · 106 . Further increase in np to 2048 and 4096 causes the performance
improvement to drop to 9.48% and 7.23%, respectively. This is attributed to the decrease
in the DLT load fraction (DLT chunk size gets smaller) which causes a decrease in the load
imbalance.

Figure 8.5
Performance of hybrid scheduling on a hybrid network comprised of star and 3D torus
topologies for the computation-bound application

162

8.4.2

Intermediate application

In Figures 8.6 - 8.8, the performance of DLT and AF algorithms are marginally below
the theoretically predicted lower bound. For theoretical analysis, the load fractions are used
as a foating point value, whereas, in simulation, the load fractions are rounded to integral
values since a task is the smallest unit of execution and cannot be further subdivided.
This is attributed to the marginal difference in the performance of DLT compared to the
theoretical lower bound and the percentage difference is less than 1.14% in all cases. The
percentage difference is less than 1.9% in all cases for the AF algorithm.

Figure 8.6
Performance of hybrid scheduling on a hybrid network comprised of star topology for the
intermediate application

163

Figure 8.7
Performance of hybrid scheduling on a hybrid network comprised of 3D torus topology
for the intermediate application

For all the problem and system sizes considered, the performance of AF is better than
the performance of DLT. The performance improvement of AF over DLT is in the range
[0.28% − 8.01%]. As in the case of the computation-bound application, for a given hybrid
network and a given system size np , the performance improvement of AF over DLT is
independent of the problem size, nt for the same reason. However, in this case, the ratio
of computation cost to the communication cost is at the maximum 36.88% for all problem
sizes and types of hybrid networks when the system size is np = 256.
Unlike in the case of the computation-bound application, the performance improvement of AF over DLT decreases with the increase in the system size for all the problem
sizes. The above is true despite the problem sizes for this application are higher than the
164

Figure 8.8
Performance of hybrid scheduling on a hybrid network comprised of star and 3D torus
topologies for the intermediate application

problem sizes for the computation-bound application. This is attributed to the coarse grain
nature of the tasks of this application which results in low execution effciency.

8.4.3

Communication-bound application

In Figures 8.9 - 8.11, the performance of AF algorithm is marginally below the theoretically predicted lower bound for the same reason as in the case of the intermediate
application. The percentage difference is less than 0.85% in all cases. For all the problem and system sizes considered, the performance of AF is better than the performance of
DLT. The performance improvement of AF over DLT is in the range [0.59% − 3.61%]. The
percentage improvement is the lowest among the three applications since the tasks of this
application are more coarser than the tasks from the other two applications.
165

Figure 8.9
Performance of hybrid scheduling on a hybrid network comprised of star topology for the
communication-bound application

As in the case of other two applications, for a given hybrid network and a given system
size np , the performance improvement of AF over DLT is independent of the problem
size, nt for the same reason. However, in this case, the ratio of computation cost to the
communication cost is at the maximum 12.31% for all problem sizes and types of hybrid
networks when the system size is np = 256. The value of this ratio is the lowest among the
three applications.
Similar to the case of the intermediate application but unlike in the case of the computationbound application, the performance improvement of AF over DLT decreases with the increase in the system size for all the problem sizes. The above is true despite the problem
sizes for this application are higher than the problem sizes for the computation-bound ap166

Figure 8.10
Performance of hybrid scheduling on a hybrid network comprised of 3D torus topology
for the communication-bound application

plication. This is attributed to the coarse grain nature of the tasks of this application which
results in low execution effciency.

8.5

Conclusions
In this chapter, we explored the possibility of integrating the DLT and the DLS algo-

rithms in a certain manner. Such an integrated scheduling approach, called hybrid scheduling, will combine the advantages of DLT and DLS algorithms. Integrating DLS with the
DLT will yield better performance than that is possible by employing DLT in isolation and
will improve the robustness of DLT. On the other hand, integrating DLT with the DLS algorithms will make performance prediction possible when a DLS algorithm is employed
for scheduling.
167

Figure 8.11
Performance of hybrid scheduling on a hybrid network comprised of star and 3D torus
topologies for the communication-bound application

An approach to integrate the DLT and DLS algorithms is to apply them individually
in a two-level hierarchical fashion where DLT algorithm is applied to schedule workload
at the top level and DLS algorithm is employed at the bottom level. Hybrid scheduling
can be employed to schedule workload in a Grid computing environment where DLT can
be employed to partition and assign the workload to individual resources in a computing
grid and DLS can be employed to schedule the workload within a grid resource. Another
environment where hybrid scheduling can be employed is within a computing center that
hosts a number of computing resources. Applying DLT at the top level requires knowledge
of the overall computing power of individual resources at the bottom level. The processor
equivalence principle offered by DLT is leveraged for this purpose. Through the use of this
168

principle, we showed how the network elements in two topologies, namely, the star and
the 3D torus topology, can be collapsed into a single equivalent element with equivalent
processing power. This information is then used at the top level to partition the workload
based on DLT. When a computing resource at the bottom level receives the workload for
computing, it can employ any DLS scheduling algorithm.
Simulation experiments were conducted on networks interconnected in star and 3D
torus fashion using three applications, namely, the computation-bound, the intermediate,
and the communication-bound application as described in 4.1.3. At the bottom level, perturbations were injected in the compute nodes to simulate the presence of other jobs in the
system that compete for the CPU cycles. In most cases, the performance of DLT and AF
were within the theoretically predicted bounds and in few cases their performance were
below the predicted lower bound. This is because in theoretical analysis, the load fractions
are used as a foating point value, whereas, in simulation, the load fractions are rounded
to integral values since a task is the smallest unit of execution and cannot be further subdivided. In all cases, employing AF at the bottom level gave better performance than
employing DLT at the bottom level. The performance improvement of AF over DLT is in
the ranges [2.85% − 11.25%] for the computation-bound application, [0.28% − 8.01%] for
the intermediate application, and [0.59% − 3.61%] for the communication-bound application. These results demonstrate that performance prediction and performance improvement
are possible when DLT and DLS algorithms are integrated and underscores the utility of
hybrid scheduling.

169

CHAPTER 9
CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

In this last chapter of the dissertation, an overall evaluation of the research work is
presented. The future research directions are also outlined.

9.1

Accomplishments
The primary goal of this dissertation is to analyze and evaluate the performance of DLT

[21] and DLS algorithms [12] in parallel and distributed computing environments with
emphasis on developing an integrated scheduling approach that combines the advantages
of DLT and DLS algorithms.
In section 1.4, a number of objectives to be fulflled in order to achieve the primary goal
of this dissertation are listed. Objective 1 is addressed in chapter 2, where the related areas of research, namely, the divisible load theory algorithms, the dynamic loop scheduling
algorithms, and the robustness of scheduling algorithms are surveyed in depth. In chapter
3, various performance evaluation techniques and computing environments are discussed,
followed by the design of a simulation framework that is developed as part of this dissertation. In chapter 4, the modeling of the applications and the platforms used in this study
is described. Chapters 3 and 4 forms the basis for the performance evaluation of DLT and
DLS algorithms.
170

Objective 2 is addressed in chapter 5, where a scalability study of DLT and DLS algorithms is presented. Objective 3 is addressed in chapter 6, where the robustness of DLT
algorithms is investigated. Objective 4 is addressed in chapter 7, where the effect of network topology on the performance of DLT and DLS algorithms is studied. Objectives 5
and 6 are addressed in chapter 8, where the design, analysis, and evaluation of an integrated scheduling approach are presented. Objective 7 is addressed in this chapter, where
the conclusions regarding the accomplishments along with the perspectives on the possible
future directions are outlined.
The following conference articles are published at the time of writing this dissertation.
1. Towards the Scalability of Dynamic Loop Scheduling Techniques via Discrete
Event Simulation - In the Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, IEEE Computer Society Press.
2. Analyzing the Robustness of Scheduling Algorithms Using Divisible Load Theory on Heterogeneous Systems - In the Proceedings of the IEEE International Symposium on Parallel and Distributed Computing, ISPDC 2013, IEEE Computer Society Press.
3. A Comparative Study of Two Common Algorithmic Approaches - In the Proceedings of the International Conference on Parallel Processing, ICPP 2013, IEEE
Computer Society Press.
4. Robustness Prediction and Evaluation of Divisible Load Scheduling on Computing Systems with Unpredictable Variations - In the Proceedings of the IEEE
171

International Symposium on Parallel and Distributed Computing, ISPDC 2014, IEEE
Computer Society Press.
5. Scalability Analysis and Evaluation of Divisible Load Scheduling - In the Proceedings of the International Conference on Parallel Processing, ICPP 2014, IEEE
Computer Society Press.

9.2

Summary and lessons learned
Existing studies demonstrating the effectiveness of DLS algorithms used smaller prob-

lem and system sizes. In this work, the scalability of DLS algorithms was studied in the
context of larger problem and system sizes. The AF algorithm, which has the strongest
theoretical foundation among all the DLS algorithms, performed the best. An interesting
observation was the performance of FSC algorithm which is an ideal trade-off between
static chunking and self scheduling. The performance of FSC was on par with AF algorithm when an optimal fxed size chunk can be found. However, identifying an optimal
fxed size chunk is problem specifc.
The scalability of DLT algorithms was analyzed using two NAS benchmarks, namely,
the EP and the IS benchmark. The EP benchmark is an ideal application for scheduling
via DLT since it follows the bag of tasks (BoT) model. The IS benchmark has three computation phases interleaved with three communication phases. The nature of the problem
is such that the communication phases involve all-to-all communication which cannot proceed before the previous computation phase is complete. Identifying the load fractions
for the IS benchmark is more computationally intensive than it is for the EP benchmark
172

because of the complex nature of the IS benchmark. Though, in principle, DLT can be
applied to solve problems similar to IS benchmark, problems like EP benchmark are more
amenable for scheduling via DLT.
The robustness of DLT algorithms when used to schedule a computation-bound, an
intermediate, and a communication-bound application on an homogeneous 3D torus network was investigated using the FePIA procedure [8]. Perturbations were injected in the
performance impacting factors, namely, the processor availability and the network link
bandwidth, both individually as well as in combination, and their impact on the robustness of DLT was studied. The results indicate that the robustness observed via simulation
is always within the analytically predicted range for the given applications, system, and
perturbations. The results emphasize the predictive power of the divisible load theory numerical model even in environments where the actual values of the system parameters are
different from the nominal values.
The effect of topology on the performance of DLT and AF algorithms was studied in
the context of master-workers paradigm on three topologies, namely, star, 3D torus, and
fat-tree. Routing algorithms are responsible for the routing of messages in a network is
dependent on the network topology. The nature of routing algorithms combined with large
volumes of data transfer and multiple network fows at a same time instant give rise to
congestion in the network links. A new terminology called the congestion factor cf , which
represents the number of network fows that compete for the network link bandwidth, was
introduced. The value of cf is 1 for the star topology, (nx − 1) · ny · nz in the case of
XYZ routing for the 3D torus topology, and (np − 1) for the fat-tree topology. A value
173

of 1 for cf indicates the absence of congestion, and any value greater than 1 indicates
the presence of congestion. Congestion in absent in star networks, and hence DLT and
AF yielded same performance. However, on 3D torus and fat-tree networks, congestion
is present and its impact is more on DLT than on AF due to the presence of one-to-all
communication that distributes the entire workload to the worker processors. Multi-round
scheduling and round robin (RR) dimension order routing, which makes effective use of
all the six network links that are connected to a compute node on a 3D torus network,
were evaluated for congestion alleviation in the network links. On 3D torus networks,
the multi-round scheduling scheme and the RR dimension order routing scheme yielded
performance improvements in the range [−32.91% − 129.07%] and [−0.65% − 142.38%],
respectively, suggesting the effectiveness of these approaches for the 3D torus topology.
On fat-tree networks, the multi-round scheduling scheme yielded marginal performance
improvement in the range [0.001% − 8.63%]. On fat-tree networks, the link connecting the
master node to the lowest level switch is the bottleneck link and the entire data travels over
this link. As a result, the master-workers paradigm does not yield signifcant performance.
A topic for further study is to explore the effectiveness of masters-workers paradigm for
the fat-tree topology.
The possibility of integrating DLT and DLS algorithms (called hybrid scheduling) in
a certain manner was explored. One way to integrate both DLT and DLS algorithms is to
have a two-level scheduling where DLT is employed to schedule workload at the top level
and DLS is employed to schedule workload at the bottom level. The bottom level resources
can either be geographically distributed as in the case of Grid computing, or they can be
174

distributed within a single administrative domain. The advantage of hybrid scheduling is
that it can improve the robustness of DLT, and hence it can improve the applicability of
DLT in environments characterized by unpredictable variations in the system load and network links, where traditionally it has not been applied. Processor equivalence principle of
DLT was applied to identify the load fractions of each bottom level resources where AF was
used to schedule the workload. The performance of hybrid scheduling was evaluated on
three hybrid networks using a computation-bound, an intermediate, and a communicationbound application. In all cases, employing AF at the bottom level gave better performance
than employing DLT at the bottom level. The performance improvement of AF over DLT is
in the range [2.85%−11.25%] for the computation-bound application, [0.28%−8.01%] for
the intermediate application, and [0.59% − 3.61%] for the communication-bound application. These results demonstrate that performance prediction and performance improvement
are possible when DLT and DLS algorithms are integrated, and underscore the utility of
hybrid scheduling.

9.3

Future research directions
The goal of this research work is to analyze and evaluate the performance of divisible

load theory and dynamic loop scheduling algorithms in parallel and distributed environments. Based on the perspectives gained during the course of this dissertation research, the
following areas show promise for potential future work.
In order to employ the hybrid scheduling approach proposed in this work, we relied
on the processor equivalence principle offered by DLT in order to estimate the comput175

ing power of a compound resource. A proof by induction was provided in the case of
star topology with homogeneous elements. There is considerable scope to improve the
usefulness of the model by considering heterogeneous networks. For the other topology
considered, namely, the 3D torus, in general, the theoretical predictions agree with the
simulation results but rigorous proof for the equivalent processing power is required.
The scheduling model of DLT is dependent on the platform topology since the communication costs associated with processing the workload are explicitly modeled. The routing
algorithm is responsible for moving messages between compute nodes in a network and is
dependent on the topology of the underlying network. Related to routing is congestion in
the network links. In this work, congestion was not modeled in case of 3D torus networks
and there is opportunity to improve the usefulness of the model by accounting for it.
In the hybrid scheduling approach proposed in this work, we employed AF as the chosen DLS algorithm inside of a compound resource since AF has the strongest theoretical
foundation among the DLS algorithms, given that its modeling contains least amount of
constraints. However, as we saw in the scalability study of DLS algorithms (refer to chapter 5), fxed size chunking can also yield performance on par with AF when the optimal
chunk size can be found. A better approach is to choose a DLS algorithm that is most
suitable for scheduling the given workload by employing a portfolio-based approach [73]
that will dynamically select an appropriate DLS algorithm based on given workload and
system characteristics.
In this research, simulations were employed as a performance evaluation technique
with a primary goal of conducting realistic simulations. In order to achieve this goal, Sim176

Grid simulation framework was used which is a validated framework. In addition, the
application and platform characteristics used in the simulations were modeled after real
applications and platforms, respectively. In general, results from the simulations provide
a close ft to the theoretically predicted values. This gives us high confdence in the applicability and the validity of the results obtained. An interesting future work is to evaluate
the concepts that were studied in this research on real platforms using real applications. It
was reported in [58] that the difference between theoretical predictions and experimental
results was in the range [5% − 10%].

177

REFERENCES

[1] “Cray XK7 Specifcations,”.
[2] “Top 500 Supercomputer Sites,”.
[3] M. Abdullah, M. Othman, H. Ibrahim, and S. Subramaniam, “Optimal Workload Allocation Model for Scheduling Divisible Data Grid Applications,” Future Generation
Computer Systems, vol. 26, no. 07, 2010, pp. 971–978.
[4] D. Abrahams and A. Gurtovoy, C++ Template Metaprogramming: Concepts, Tools,
and Techniques from Boost and Beyond, Addison-Wesley Professional, 2004.
[5] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Giampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken, et al., “Blue Gene/L Torus
Interconnection Network,” IBM Journal of Research and Development, vol. 49, no.
2.3, 2005, pp. 265–276.
[6] S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, C. McCurdy, J. Rogers, P. Roth,
R. Sankaran, J. S. Vetter, P. Worley, and W. Yu, “Early Evaluation of IBM BlueGene/P,” Int. Conf. for High Performance Computing, Networking, Storage and
Analysis, 2008, pp. 1–12.
[7] S. Ali, J.-K. Kim, H. J. Siegel, and A. A. Maciejewski, “Static Heuristics for Robust
Resource Allocation of Continuously Executing Applications,” Journal of Parallel
and Distributed Computing, vol. 68, no. 8, 2008, pp. 1070 – 1080.
[8] S. Ali, A. A. Maciejewski, H. J. Siegel, and J.-K. Kim, “Measuring the Robustness
of a Resource Allocation,” IEEE Transactions on Parallel and Distributed Systems,
vol. 15, 2004, pp. 630–641.
[9] S. Ali, A. A. Maciejewski, H. J. Siegel, and J.-K. Kim, “Robust Resource Allocation for Sensor-Actuator Distributed Computing Systems,” Proc. 2004 Int. Conf. on
Parallel Processing. 2004, pp. 178–185, IEEE Computer Society.
[10] M. Balasubramaniam, I. Banicescu, and F. Ciorba, “Scalability Analysis and Evaluation of Divisible Load Scheduling,” 43rd Int. Conf. on Parallel Processing Workshops
(ICPP). IEEE Computer Society, Sep 2014, pp. 37–44.

178

[11] M. Balasubramaniam, N. Sukhija, F. M. Ciorba, I. Banicescu, and S. Srivastava, “Towards the Scalability of Dynamic Loop Scheduling Techniques via Discrete Event
Simulation,” 26th Int. Parallel and Distributed Processing Symposium Workshops &
PhD Forum (IPDPS). IEEE Computer Society, 2012, pp. 1343–1351.
[12] I. Banicescu and R. L. Cariño, “Addressing the Stochastic Nature of Scientifc Computations via Dynamic Loop Scheduling,” Electronic Journal on Transactions on
Numerical Analysis, Special Issue on Combinatorial Scientifc Computing, 2005, pp.
66–80.
[13] I. Banicescu, F. Ciorba, and R. Carino, “Towards the Robustness of Dynamic Loop
Scheduling on Large-Scale Heterogeneous Distributed Systems,” Proceedings of the
2009 Eighth International Symposium on Parallel and Distributed Computing. IEEE
Computer Society, 2009, pp. 129–132.
[14] I. Banicescu and V. Velusamy, “Load Balancing Highly Irregular Computations with
the Adaptive Factoring,” Proc. 16th IEEE Int. Parallel and Distributed Processing
Symp. IEEE Computer Society Press, 2002.
[15] O. Beaumont, N. Bonichon, and L. Eyraud-Dubois, “Scheduling Divisible Workloads on Heterogeneous Platforms under Bounded Multi-Port Model,” Proc. 22nd
Int. Parallel and Distributed Processing Symp. IEEE Computer Society, 2008, pp.
1–7.
[16] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert, “Independent and Divisible
Tasks Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory,”
Proc. 13th Euromicro Conf. on Parallel, Distributed and Network-Based Processing.
IEEE Computer Society, 2005, pp. 179–186.
[17] O. Beaumont, A. Legrand, Y. Robert, L. Carter, and J. Ferrante, “Bandwidth-Centric
Allocation of Independent Tasks on Heterogeneous Platforms,” Proc. 16th Int. Parallel and Distributed Processing Symp. 2002, IEEE Computer Society.
[18] O. Beaumont and A. L. Rosenberg, “Link-Heterogeneity vs. Node-Heterogeneity
in Clusters,” Proc. 2010 Int. Conf. on High Performance Computing (HiPC). IEEE
Computer Society, 2010, pp. 1–8.
[19] J. Berliska and M. Drozdowski, “Scheduling Divisible MapReduce Computations,”
Journal of Parallel and Distributed Computing, vol. 71, no. 03, 2011, pp. 450 – 459.
[20] R. Bertin, S. Hunold, A. Legrand, and C. Touati, “Fair Scheduling of Bag-of-Tasks
Applications Using Distributed Lagrangian Optimization,” Journal of Parallel and
Distributed Computing, vol. 74, no. 1, 2014, pp. 1914 – 1929.
[21] V. Bharadwaj, D. Ghose, V. Mani, and T. G. Robertazzi, Scheduling Divisible Loads
in Parallel and Distributed Systems, Wiley-IEEE Computer Society Press, 1996.
179

[22] V. Bharadwaj, D. Ghose, and T. G. Robertazzi, “Divisible Load Theory: A New
Paradigm for Load Scheduling in Distributed Systems,” Cluster Computing, vol. 6,
2003, pp. 7–17.
[23] V. Bharadwaj, H. F. Li, and T. Radhakrishnan, “Scheduling Divisible Loads in Bus
Networks with Arbitrary Processor Release Times,” Computers & Mathematics with
Applications, vol. 32, no. 7, 1996, pp. 57–77.
¨ oni
¨ and D. Marinescu, “Robust Scheduling of Metaprograms,” Journal of
[24] L. Bol
Scheduling, vol. 5, no. 5, 2002, pp. 395–412.
[25] A. Burns, S. Punnekkat, B. Littlewood, D. Wright, et al., “Probabilistic Guarantees
for Fault-Tolerant Real-Time Systems,” Design for Validation (DeVa) TR, , no. 44,
1997.
[26] R. Buyya and M. Murshed, “GridSim: A Toolkit for the Modeling and Simulation
of Distributed Resource Management and Scheduling for Grid Computing,” Concurrency and Computation: Practice and Experience, vol. 14, no. 13-15, 2002, pp.
1175–1220.
[27] R. L. Cariño and I. Banicescu, “Dynamic Load Balancing with Adaptive Factoring
Methods in Scientifc Applications,” The Journal of Supercomputing, vol. 44, April
2008, pp. 41–63.
[28] T. E. Carroll and D. Grosu, “Divisible Load Scheduling: An Approach Using Coalitional Games,” Proc. 6th Int. Symp. on Parallel and Distributed Computing. IEEE
Computer Society, 2007, pp. 258–265.
[29] H. Casanova, A. Legrand, and M. Quinson, “SimGrid: A Generic Framework for
Large-Scale Distributed Experiments,” 10th IEEE Int. Conf. on Computer Modeling
and Simulation, Mar. 2008.
[30] H. Casanova and L. Marchal, A Network Model for Simulation of Grid Application,
Research Report RR-4596, INRIA, 2002.
[31] T. Casavant and J. Kuhl, “A Taxonomy of Scheduling in General-Purpose Distributed
Computing Systems,” IEEE Transactions on Software Engineering, vol. 14, no. 2,
Feb. 1988, pp. 141 –154.
[32] D. M. Chiu, “Some Observations on Fairness of Bandwidth Sharing,” 5th IEEE
Symp. on Computers and Communications, 2000, pp. 125–131.
[33] A. Davenport, J. Beck, et al., “Slack-based techniques for robust schedules,” 2001.
[34] J. Dean and S. Ghemawat, “MapReduce: Simplifed Data Processing on Large Clusters,” Communications of the ACM, vol. 51, no. 01, 2008, pp. 107–113.
180

[35] M. Drozdowski and W. Gazek, “Scheduling Divisible Loads in a Three-Dimensional
Mesh of Processors,” Parallel Computing, vol. 25, no. 4, 1999, pp. 381–404.
[36] M. Drozdowski and L. Wielebski, “Isoeffciency Maps for Divisible Computations,”
IEEE Transactions on Parallel and Distributed Systems, vol. 21, 2010, pp. 872–880.
[37] M. Drozdowski and P. Wolniewicz, “Experiments with Scheduling Divisible Tasks in
Clusters of Workstations,” Euro-Par 2000 Parallel Processing, 2000, pp. 311–319.
[38] A. Ghatpande, H. Nakazato, O. Beaumont, and H. Watanabe, “SPORT: An Algorithm
for Divisible Load Scheduling With Result Collection on Heterogeneous Systems,”
IEICE Transactions on Communications, vol. 91, no. 08, 2008, pp. 2571–2588.
[39] H. González-Vélez and M. Cole, “Adaptive Statistical Scheduling of Divisible Workloads in Heterogeneous Systems,” Journal of Scheduling, vol. 13, no. 4, 2010, pp.
427–441.
[40] S. D. Gribble, “Robustness in Complex Systems,” Proc. 8th Workshop on Hot Topics
in Operating Systems. 2001, HOTOS ’01, IEEE Computer Society.
[41] D. Grosu and T. E. Carroll, “A Strategyproof Mechanism for Scheduling Divisible
Loads in Distributed Systems,” Proc. 4th Int. Symp. on Parallel and Distributed
Computing. 2005, ISPDC ’05, pp. 83–90, IEEE Computer Society.
[42] A. Guermouche and H. Renard, “A First Step to the Evaluation of SimGrid in the
Context of a Real Application,” 2010 IEEE Int. Symp. on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE Computer Society, Apr 2010,
pp. 1–10.
[43] J. Hu and R. Klefstad, “Scheduling Multiple Divisible and Indivisible Tasks on Bus
Networks,” Proc. 2007 IEEE Int. Conf. on Cluster Computing. IEEE Computer Society, 2007, pp. 222–230.
[44] S. Hummel, J. Schmidt, R. Uma, and J. Wein, “Load-Sharing in Heterogeneous Systems via Weighted Factoring,” Proc. 8th Annu. ACM Symp. on Parallel Algorithms
and Architectures, 1997.
[45] S. Hummel, E. Schonberg, and L. Flynn, “Factoring: A Method for Scheduling
Parallel Loops,” Communications of the ACM, vol. 35, no. 8, 1992, pp. 90–101.
[46] R. K. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley and Sons,
New York, April 1991.
[47] E. Jen, “Stable or Robust? What’s the Difference?,” Complexity, vol. 8, no. 3, 2003,
pp. 12–18.
181

[48] D. Klusáˇcek and H. Rudov´a, “Alea 2 – Job Scheduling Simulator,” Proc. of the 3rd
Int. Conf. on Simulation Tools and Techniques (SIMUTools 2010). 2010, ICST.
[49] D. Kothe, “Science Prospects and Benefts with Exascale Computing,” Oak Ridge
National Laboratory, Tech. Rep. ORNL/TM-2007/232, 2007.
[50] B. Kreaseck, L. Carter, H. Casanova, and J. Ferrante, “Autonomous Protocols for
Bandwidth-Centric Scheduling of Independent-Task Applications,” Proc. 17th Int.
Symp. on Parallel and Distributed Processing. 2003, IEEE Computer Society.
[51] C. Kruskal and A. Weiss, “Allocating Independent Subtasks on Parallel Processors,”
IEEE Transactions on Software Engineering, vol. 11, no. 10, 1985, pp. 1001–1016.
[52] X. Li, V. Bharadwaj, and C. Ko, “Divisible Load Scheduling on Single-Level Tree
Networks with Buffer Constraints,” IEEE Transactions on Aerospace and Electronic
Systems, vol. 36, no. 4, Oct. 2000, pp. 1298 – 1308.
[53] X. Li and B. Veeravalli, “PPDD: Scheduling Multi-site Divisible Loads in Singlelevel Tree Networks,” Cluster Computing, vol. 13, no. 1, 2010, pp. 31–46.
[54] L. Massoulié and J. Roberts, “Bandwidth Sharing: Objectives and Algorithms,”
IEEE/ACM Trans. Netw., vol. 10, no. 3, June 2002, pp. 320–328.
[55] M. Oltikar, J. Brateman, J. White, J. Martin, K. Knapp, A. A. Maciejewski, and H. J.
Siegel, “Robust Resource Allocation in Weather Data Processing Systems,” Proc.
Int. Conf. Workshops on Parallel Processing. 2006, pp. 445–454, IEEE Computer
Society.
[56] C. Papadimitriou and M. Yannakakis, “Towards an Architecture-Independent Analysis of Parallel Algorithms,” SIAM Journal on Computing, vol. 19, 1990, pp. 322–328.
[57] C. D. Polychronopoulos and D. J. Kuck, “Guided Self-Scheduling: A Practical
Scheduling Scheme for Parallel Supercomputers,” IEEE Transactions on Computers, vol. C-36, no. 12, 1987, pp. 1425 –1439.
[58] T. Robertazzi, “Ten Reasons to Use Divisible Load Theory,” Computer, vol. 36, no.
5, May 2003, pp. 63 – 68.
[59] T. Robertazzi and D. Yu, “Multi-Source Grid Scheduling for Divisible Loads,” Proc.
40th Annu. Conf. on Information Sciences and Systems. IEEE Computer Society,
2006, pp. 188–191.
[60] T. G. Robertazzi, “A Product form Solution for Tree Networks with Divisible Loads,”
Parallel Processing Letters, vol. 21, no. 01, 2011, pp. 13–20.

182

[61] V. Shestak, E. K. P. Chong, A. A. Maciejewski, H. J. Siegel, L. Benmohamed, I.J. Wang, and R. Daley, “Resource Allocation for Periodic Applications in a Shipboard Environment,” Proc. 19th IEEE Int. Parallel and Distributed Processing Symp.
(IPDPS’05) - Workshop 1 - Volume 02. 2005, IEEE Computer Society.
[62] V. Shestak, J. Smith, A. A. Maciejewski, and H. J. Siegel, “Iterative Algorithms for
Stochastically Robust Static Resource Allocation in Periodic Sensor Driven Clusters,”
Proc. 8th IASTED Int. Conf. on Parallel and Distributed Computing and Systems,
2006, pp. 166–174.
[63] V. Shestak, J. Smith, R. Uml, J. Hale, P. Moranville, A. A. Maciejewski, and H. J.
Siegel, “Greedy Approaches to Static Stochastic Robust Resource Allocation for
Periodic Sensor Driven Distributed Systems,” Proc. Int. Conf. on Parallel and Distributed Processing Techniques and Applications, 2006, pp. 3–9.
[64] A. Shokripour, M. Othman, H. Ibrahim, and S. Shamala, “A New Method for
Scheduling Divisible Data on a Heterogeneous Two-Levels Hierarchical System,”
Procedia Computer Science, vol. 4, 2011, pp. 2196–2205.
[65] A. Shokripour, M. Othman, H. Ibrahim, and S. Subramaniam, “A New Method for
Job Scheduling in a Non-dedicated Heterogeneous System,” Procedia Computer
Science, vol. 3, 2011, pp. 271–275.
[66] J. Smith, L. Briceno, A. Maciejewski, H. Siegel, T. Renner, V. Shestak, J. Ladd,
A. Sutton, D. Janovy, S. Govindasamy, A. Alqudah, R. Dewri, and P. Prakash, “Measuring the Robustness of Resource Allocations in a Stochastic Dynamic Environment,” Proc. Int. Parallel and Distributed Processing Symposium. 2007, pp. 1–10,
IEEE Computer Society.
[67] J. Smith, E. K. P. Chong, A. Maciejewski, and H. Siegel, “Decentralized MarketBased Resource Allocation in a Heterogeneous Computing System,” Proc. IEEE Int.
Symp. on Parallel and Distributed Processing, 2008, pp. 1–12.
[68] J. Smith, V. Shestak, H. Siegel, and P. Sugavanum, “Resource Allocation in a Cluster
Based Imaging System,” Proc. Int. Conf. on Parallel and Distributed Processing
Techniques and Applications, 2007.
[69] H. J. Song, X. Liu, D. Jakobsen, R. Bhagwan, X. Zhang, K. Taura, and A. Chien, “The
MicroGrid: A Scientifc Tool for Modeling Computational Grids,” Sci. Program., vol.
8, no. 3, Aug. 2000, pp. 127–141.
[70] S. Srivastava, I. Banicescu, F. M. Ciorba, and W. E. Nagel, “Enhancing the Functionality of a GridSim-based Scheduler for Effective Use with Large-Scale Scientifc
Applications,” IEEE Int. Symp. on Parallel and Distributed Computing, 2011.

183

[71] S. Srivastava, N. Sukhija, I. Banicescu, and F. M. Ciorba, “Analyzing the Robustness
of Dynamic Loop Scheduling for Heterogeneous Computing Systems,” Proc. 11th
Int. Symposium on Parallel and Distributed Computing. 2012, ISPDC ’12, pp. 156–
163, IEEE Computer Society.
[72] N. Sukhija, I. Banicescu, S. Srivastava, and F. M. Ciorba, “Evaluating the Flexibility
of Dynamic Loop Scheduling on Heterogeneous Systems in the Presence of Fluctuating Load using SimGrid,” Proc. 14th Int. Workshop on Parallel and Distributed
Scientifc and Engineering Computing. 2013, IEEE Computer Society.
[73] N. Sukhija, B. Malone, S. Srivastava, I. Banicescu, and F. M. Ciorba, “PortfolioBased Selection of Robust Dynamic Loop Scheduling Algorithms Using Machine
Learning,” Proc. of the 2014 IEEE International Parallel & Distributed Processing
Symposium Workshops. 2014, pp. 1638–1647, IEEE Computer Society.
[74] S. Suresh, C. Run, H. J. Kim, T. G. Robertazzi, and Y.-I. Kim, “Scheduling
Second-Order Computational Load in Master-Slave Paradigm,” IEEE Transactions
on Aerospace and Electronic Systems, vol. 48, no. 01, 2012, pp. 780–793.
[75] A. Takefusa, S. Matsuoka, H. Nakada, K. Aida, and U. Nagashima, “Overview of
a Performance Evaluation System for Global Computing Scheduling Algorithms,”
Proc. 8th Int. Symp. on High Performance Distributed Computing, 1999. IEEE Computer Society, 1999, pp. 97–104.
[76] B. Veeravalli, X. Li, and C. C. Ko, “On the Infuence of Start-up Costs in Scheduling
Divisible Loads on Bus Networks,” IEEE Transactions on Parallel and Distributed
Systems, vol. 11, no. 12, Dec. 2000, pp. 1288–1305.
[77] P. Velho, L. Schnorr, H. Casanova, and A. Legrand, “On the Validity of Flow-level
TCP Network Models for Grid and Cloud Simulations,” ACM Transactions on Modeling and Computer Simulation, vol. 23, no. 3, Oct. 2013.
[78] S. Viswanathan, B. Veeravalli, and T. Robertazzi, “Resource-Aware Distributed
Scheduling Strategies for Large-Scale Computational Cluster/Grid Systems,” IEEE
Transactions on Parallel and Distributed Systems, vol. 18, no. 10, Oct 2007, pp. 1450
–1461.
[79] H. M. Wong, D. Yu, B. Veeravalli, and T. G. Robertazzi, “Data Intensive Grid
Scheduling: Multiple Sources with Capacity Constraints,” Proc. 15th IASTED Int.
Conf. on Parallel and Distributed Computing and Systems, 2003, vol. 1, pp. 7–11.
[80] Y. Yang, K. Van Der Raadt, and H. Casanova, “Multiround Algorithms for Scheduling Divisible Loads,” IEEE Transactions on Parallel and Distributed Systems, vol.
16, no. 11, 2005, pp. 1092–1102.

184

[81] D. Yu and T. Robertazzi, “Divisible Load Scheduling for Grid Computing,” 15th
IASTED Int. Conf. on Parallel and Distributed Computing and Systems, 2003, vol. 1,
pp. 1–6.
[82] E. Zahavi, “D-Mod-K Routing Providing Non-Blocking Traffc for Shift Permutations on Real Life Fat Trees,” Irwin and Joan Jacobs, Center for Communication and
Information Technologies Report, vol. 776, 2010, pp. 1–7.

185

