Optimizing the Performance of Multi-threaded Linear Algebra Libraries Based on Task Granularity by Shirzad, Shahrzad
Louisiana State University 
LSU Digital Commons 
LSU Doctoral Dissertations Graduate School 
10-22-2020 
Optimizing the Performance of Multi-threaded Linear Algebra 
Libraries Based on Task Granularity 
Shahrzad Shirzad 
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations 
 Part of the Other Computer Engineering Commons 
Recommended Citation 
Shirzad, Shahrzad, "Optimizing the Performance of Multi-threaded Linear Algebra Libraries Based on Task 
Granularity" (2020). LSU Doctoral Dissertations. 5376. 
https://digitalcommons.lsu.edu/gradschool_dissertations/5376 
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It 
has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU 
Digital Commons. For more information, please contactgradetd@lsu.edu. 
OPTIMIZING THE PERFORMANCE OF MULTI-THREADED LINEAR
ALGEBRA LIBRARIES BASED ON TASK GRANULARITY
A Dissertation
Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
in
The Division of Electrical and Computer Engineering
by
Shahrzad Shirzad
B.Sc., Sharif University of Technology, 2006
M.Sc., K. N. Toosi University of Technology, 2009
December 2020
Acknowledgments
First of all I would like to thank my advisors Dr. Kaiser and Dr. Ramanujam for their
help and support. I am and will always be beyond grateful for Dr. Kaiser’s care, support and
guidance throughout these years.
I would like to express my gratitude and appreciation for Dr. Soysal for supporting me and
helping me to grow.
I also want to thank my husband Alireza for his love and all the sacrifices he made to make
this possible for me, and my little one, Damon, whose smile brings joy to my life.
Special thanks to my sister Sharareh, I was so lucky to have her here with me, it certainly
made life easier and more pleasant for me.
I have to thank my parents, my sister Shohreh and her lovely family including my dearests
Kian and Karen for their love and support, I couldn’t go on without it.
Special thanks to Maziar, Mobina, Kasra, Raysan with their little one Zhina, Melika, Elika
and Zahra for their invaluable friendship throughout these years.
Finally I want to express my deep gratitude to my friends at CCT especially Adrian, Rod,
and Bibek for spending hours listening to me, guiding and motivating me, and Katie for all the
time she spent proofreading my dissertation.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1. Asynchronous Many-task Runtime Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Task Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Analytical Modeling of the Execution Time of the Parallel Programs . . . . . . . . . . 7
2.4. Loop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5. Blaze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3. Data Collection and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1. Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3. Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4. Polynomial Modeling of the Throughput in Terms of Grain Size . . . . . . . . . . . . . . . . . . . 26
4.1. Generalizing the Fitted Function to Include Number of Cores . . . . . . . . . . . . . . . . 27
4.2. Finding the Range of Grain Sizes for Maximum Performance . . . . . . . . . . . . . . . . . 30
4.3. Estimating the Optimum Range of Chunk Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5. An Analytical Model for Predicting the Optimum Range of Grain
Size for Minimum Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1. Evaluating the Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2. Testing the Proposed Model on Other Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3. Identifying the Optimal Range of Grain Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4. Identifying the Range of Grain Size for Minimum Execution Time . . . . . . . . . . . . 64
5.5. Locating the Optimum Range of Grain Sizes for the for-loop Benchmark . . . . . 65
5.6. Applying the Proposed Model to the Blazemark Data . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7. Predicting the Range of Chunk Size for Minimum Execution Time. . . . . . . . . . . . 80
5.8. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6. Runtime Task Granularity Control through Splittable Tasks . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1. Splittable Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3. Using the Proposed Analytical Model to Estimate the Stop
Splitting Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
iii
7. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Appendix
A. Implementation of the Splittable Task and Executor in HPX . . . . . . . . . . . . . . . . . . . . . . . 123
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.1. Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
iv
Abstract
Linear algebra libraries play a very important role in many HPC applications. As larger
datasets are created everyday, it also becomes crucial for the multi-threaded linear algebra
libraries to utilize the compute resources properly.
Moving toward exascale computing, the current programming models would not be able
to fully take advantage of the advances in memory hierarchies, computer architectures, and
networks. Asynchronous Many-Task(AMT) Runtime systems would be the solution to help
the developers to manage the available parallelism.
In this Dissertation we propose an adaptive solution to improve the performance of a lin-
ear algebra library based on a set of compile-time and runtime characteristics including the
machine architecture, the expression being evaluated, the number of cores to run the applica-
tion on, the type of the operation, and also the size of the matrices, to get as close as possible to
the highest performance. Our focus is on machine learning applications, where we are poten-
tially dealing with very large matrices, which could make creating temporaries very expensive.
For this purpose we selected Blaze C++ library, a high performance template-based math
library that gives us this option to access the expression tree at compile time, along with HPX,
a C++ standard library for concurrency and parallelism, as our runtime system.
HPX, as an AMT runtime system, offers scalibility and fine-grained parallelism through
creating lightweight threads for fast context switching between the threads. Finding the opti-
mum task granularity is a challenge in AMTs. Creating too many small tasks would result in
performance degradation due to scheduling overhead of the tasks, and creating too few tasks
would lead to under-utilization of the resources. Our work focuses on finding the optimum
task granularity for each specific problem.
We tried two different approaches to model the relationship between the performance
and the grain size, in order to find a range of grain sizes that could lead us to the maximum
performance.
First, we used polynomial functions to model how throughput changes in terms of the
v
grain size, and number of cores. Although this method was successful in finding the range
of grain sizes for maximum throughput, it was not physical. This motivated us to go deeper
and try to develop an analytical model for execution time in terms of grain size for balanced
parallel for loops. Based on the analytical model we propose a method to predict the range of
grain size for minimum execution time.
Moreover, since the parameters of the proposed model only depend on the system archi-
tecture, we suggest to use a parallel for-loop benchmark to find these parameters on a system,
and use it to find the range of grain size for minimum execution time for arbitrary balanced
parallel for-loop applications ran on the same machine.
Having the mentioned models, we changed the current implementation of the HPX back-
end for Blaze by adding two parameters to represent the unit of work, and the number of units
included in each task, for fine grained control of the parallelism, which is possible through
HPX runtime system. Also, a complexity estimation function has been added to Blaze to esti-
mate the number of floating point operations occurring in each unit of work.
The model parameters estimated through the parallel for-loop benchmark could also be
plugged into Blaze at compile time, in order to find the optimum range of grain size at runtime
based on the matrix sizes and complexity of the operations.
In the next step, we used the identified range of grain size to extend the previous imple-
mentation of splittable tasks, as an algorithm to control task granularity. We modified the
current implementation by scheduling the tasks on idle cores directly instead of waiting for
them to be stolen, and integrating the lower-bound of the analytical model as the threshold to




The path toward exascale computing would include more complex machine architec-
tures, deeper memory hierarchies, heterogeneous nodes, and complicated networks[1]. Cur-
rent programming models would not be sufficient. Advanced runtime systems with support
for new programming languages and models are needed to manage the huge amount of par-
allelism that would be available[2].
Asynchronous many-task(AMT) models and their corresponding run-times are the solu-
tion to keep application developers safe from the upcoming architectures, by mitigating exas-
cale difficulties to run-time level[3].
On the other hand, the core element of many high performance computing applications
is the linear algebra library. The performance of these applications significantly relies on the
performance of their linear algebra library. BLAS(Basic Linear Algebra Subprograms) are the
fundamental routines for basic vector and matrix operations. But in order to tune BLAS for a
specific architecture requires a deep understanding of memory hierarchy and registers from
the programmer[4]. Linear algebra libraries like ATLAS[4], SPIRAL[5] use hardware-specific
optimizations to improve their performance.
In this work, we study performance of a linear algebra library based on the application
parameters such as matrix size, operation, the expression, data layout, and also the machine
architecture in order to optimize the execution time. A number of these parameters can be
extracted at compile time, while extraction of the others must be postponed to runtime.
1.1. Dissertation Statement
The main objective of this dissertation is to propose a hybrid runtime and compile-time
solution for a linear algebra library to fully take advantage of the available parallelism and
resources.
We chose Blaze math library since it is a high performance template-based C++ library
that allows you to access the expression tree for each assignment at compile time, and we
chose HPX as an asynchronous may-task runtime system to manage the parallelism.
1
HPX makes it possible to create thousands to millions of lightweight user threads, to avoid
expensive context switching. Although the overhead of creating one task is negligible, creating
millions of tasks when the execution time of the program itself is small, could become signif-
icant and cause performance degradation. On the other hand if we create too few tasks it
would be very likely for us to not use our resources properly. So in every application in many
task systems, it’s very important to chose the amount of work assigned to each task, called
grain size, properly.
Through analyzing and modeling the relationship between throughput and grain size, we
would be able to identify a range of grain size that leads us to maximum performance. Once
decided how big one unit of work should be, based on the identified range we would be able
to decide on how many units of work should be packed into one task.
On the other hand, there are different models to express the relationship between the
throughput and the number of cores. Here, we are interested in developing a model to be
as realistic as possible to imitate the behavior of the throughput against both grain size and
number of cores. This would help us find how to manage the parallelism in our system to
achieve the highest performance possible.
1.2. Research Contributions
The contributions of this dissertation could be summarized into:
• Developing an analytical model to predict the total execution time of a balanced parallel
for-loop. To our knowledge, this is the first analytical model in terms of both grain size
and number of cores.
• A method has been offered to predict the range of grain sizes to achieve minimum exe-
cution time for a particular number of cores.
• Building a benchmark on top of HPX to evaluate the model. We utilized this bench-
mark to estimate the model parameters on each machine architecture. The obtained
parameters could then be used to estimate the range of grain sizes for minimum exe-
2
cution time, and consequently improve the performance of any other balanced parallel
for-loop application executed on the same system.
• Modifying the implementation of HPX backend for Blaze in order to be able to impose
fined grain parallelism and provide a tool to control task granularity.
• Modifying the current implementation of splittable tasks so that at each split, instead of
allowing the created splittable task to be stolen by free workers, we turn work stealing
off, identify the idle workers and explicitly assign the work to one of them, in order to
avoid the work stealing overhead originated from unsuccessful steal attempts.
• We utilized our proposed method to identify the optimum range of grain sizes, and use
the lower-bound of the range as a cutoff value to stop splitting in order to avoid creating
too fine tasks. This threshold is specific to the machine architecture and the application
being executed.
• Our work on runtime adaptivity through splittable tasks is integrated as splittable ex-
ecutor into HPX, an open source C++ Standard Library for Concurrency and Parallelism.
1.3. Dissertation Outline
In Chapter 2, we will explain briefly the background needed for this dissertation, includ-
ing Blaze and HPX libraries, effects of task granularity, and the Universal Scaling Law (USL)
method for modeling the throughput based on number of cores. In an effort to understand
the association between excution time and grain size we ran a set of experiments. The experi-
ment environment and the obtained results are explained in Chapter 3.
In Chapter 4 and Chapter 5 we investigate the possibility of using polynomial regression
and an analytical model respectively to express how execution time changes in terms of grain
size and number of cores for the purpose of identifying the optimum range of grain size. Chap-
ter 5 shows how the identified range of grain size could help a previously implemented algo-
rithm. Finally Chapter 7 discusses other works that have been done in our area of focus, and
3
the conclusions of this dissertation is presented in Chapter 8.
4
Chapter 2. Background
2.1. Asynchronous Many-task Runtime Systems
A parallel programming model includes a programming model, which refers to the mech-
anism for a program to express the concurrency, and an execution model, indicating how the
program creates and controls concurrency[6, 3]. Some of the common existing execution
models include fork-join, Communicating Sequential Processes(CSP), event-based models
and actor model[3]. Some of the current parallel programming frameworks include, accelerator-
based programming models like OpenACC[7], OpenCL[8], and CUDA[9], shared-memory pro-
gramming models like Intel TBB[10], Cilk[11], OpenMP[12], and distributed-memory pro-
gramming models like MPI[13], Charm++[14], Parallex[15], UPC[6].
A runtime system is in charge of creating and managing the concurrency, through im-
plementing parts of an execution model[3]. An asynchronous many-task(AMT) model on the
other hand, is a category of programming model and execution model. An AMT programming
model breaks the work into small and transferable tasks along with their associated input. In
an AMT execution model, tasks are being executed when their inputs are available rather than
in a well defined order[3].
Some of well known AMT runtimes include: HPX[2], Charm++[14], Uintah[16], Legion[17].
2.1.1. HPX
HPX[2] is a C++ runtime system for parallel and distributed applications based on ParalleX
execution model[15]. HPX creates lightweight user-level threads with fast context switching[6].
Whenever one thread is blocked, the scheduler picks a ready thread based on a scheduling pol-
icy. This allows you to hide the latency, and avoid starvation while keeping a high utilization
of the resources[6].
HPX provides work sharing and work stealing for automatic balancing of the workload
among the resources [18].
5
2.1.1.1. Execution Model
The Parallex [15] execution model identifies four potential sources for performance degra-
dation as "SLOW": Starvation, Latency, Overheads, and Waiting or contention[15].
Starvation refers to a situation where there is an insufficient amount work for the comput-
ing resource. this could be due to insufficient total amount of work available, or unbalanced
distribution of work among resources[6].
Latency is the time distance, usually measured in processor clock, of accessing remote
data or services[19].
Overhead refers to the effort that needs to be taken to manage parallel resources and ac-
tions on the critical path[6].
Finally, waiting is the contention of the shared physical and logical resources causing one
request to be blocked by another access of the same resource[19]. This could happen due to
limited network bandwidth, shared communication channels, memory bank conflicts[6],[19].
2.2. Task Granularity
Defining the grain size as the amount of work assigned to one HPX thread, Grubel[20]
studies the effect of grain size on the execution time for a fixed number of cores. The results
show that, for small grain sizes the overhead of creating the tasks, and for large grain sizes the
starvation, is the dominant factor affecting the execution time[20]. When grain size is small,
to perform same amount of work, a higher number of tasks is created, and there is an over-
head associated with creation of each task. Although this overhead is very small (order of
microseconds), when the amount of work performed by each thread is also small, this over-
head becomes significant. A the grain size increases, these overheads are amortized by the
time it takes to execute the task.
On the other hand, when grain size is increases, the number of tasks being created de-
creases, up until a point where the number of tasks being created is smaller than the number
of cores. At this point another factor would interfere with the performance, which is referred
6
to as starvation. Starvation happens where a large amount of work is assigned to some of the
cores while the other cores are idling. At this point we are not using our resources efficiently.
While overheads of creating tasks degrades the performance for small grain sizes and star-
vation causes the execution time to increase for large grain sizes, there is a region in between
where changing the grain size does not affect the performance.
Figure 2.1. The effect of task size on execution time for Stencil application [20]
2.3. Analytical Modeling of the Execution Time of the Parallel Programs
The execution time of a parallel program is highly dependent of the parallel algorithm to
be evaluated, alongside the architecture it is implemented on[21].
2.3.1. Universal Scalibility Law
Amdahl’s law[22], states that the amount of achievable speed up by adding more pro-
cessors when running a parallel application, is restricted by the amount of code that could
actually be parallelized. Equation 2.1, shows the relationship between speedup and number
7
of processors, where σ is the serial fraction of the execution time, based on Amdahl’s law[23].
S(p) = p
1+σ(p −1) (2.1)
However, Gunther[23] extends Amdahl’s law by incorporating the effect of three factors,
namely concurrency, contention, and coherency, as shown in Equation 2.2.
S(p) = p
1+σ(p −1)+κp(p −1) (2.2)
Concurrency(p) represents the linear speedup that could have been achieved if no in-
teraction existed among the processors, contention(σ) represents the serialization effect of
shared writable data, and finally coherency or data consistency(κ) represents the effort that
needs to be made for keeping shared writable data consistent[23].
Figure 2.2 shows an example of the ideal linear speedup we expect to see when increasing
the number of the processors, against the actual achievable speedup based on Amdahl’s law
and USL.
Equation 2.3 generalizes Equation 2.2 to represent the throughput by adding another
parameter(γ) to represent the serial throughput.
X (p) = γp
1+σ(p −1)+κp(p −1) (2.3)
Universal scalibility law also suggests that for some values of σ and κ there could be a
certain number of processors that yield to maximum performance[23]. Increasing the number
of processor beyond that point would only cause performance degradation.
8
Figure 2.2. An example of the achievable speedup based on Amdahl’s law and USL compared
to the ideal linear speedup where σ= 0.04 and κ= 0.005.
2.3.2. Other Models
There are a few other models that have also been suggested to simulate the scalibility.
Geometric model is a one-parameter model, in which speedup has the following relationship




The parameter φ, where 0 < φ É 1, is called the MP factor, and it represents the remaining
section of the processor capacity after deducting overheads.
The geometric model is nonphysical for large number of processors, due to inconsistency
with Coxian queuing model[24].
Quadratic model[25], with overhead parameter γ, where 0 É γ< 1, is represented in Equa-
tion 2.5.
S(p) = p −γp(p −1) (2.5)
9




The problem with this model is that its not physical. This model represents an inverted parabola
that will intersect the x axis at two points, implying that there with be a certain number of pro-
cessors starting from which the speedup would be negative.
Exponential model, is also a single parameter model α where 0 <αÉ 1. This parameter is
a combination of coherency and contention.
S(p) = p(1−α)(p−1) (2.7)
This model also has a critical point, but this point is very sensitive to α. Although this
model works very well for small number of processors, it imposes a severe capacity degrada-
tion for large number of processors.
2.4. Loop Scheduling
Loop scheduling refers to different ways iterations could be assigned to the processors and
the order of their execution. The main reason for performance degradation in loop scheduling
is load imbalance, which refers to situations where different amount of work is assigned to
different processors [26].
In static loop scheduling, each processor is assigned an equal number of consecutive iter-
ations. The iterations included in one task, are either could also be selected in a round-robin
way, which would not be helpful from cache coherency point of view [27]. This scheduling
method includes almost no runtime overhead due to scheduling, due to assignments occur-
ring at compile time. Although iterations are evenly assigned to the processors, several factors
can still contribute to difference in execution time of the iterations, resulting in load imbal-
ance between the processors. These factors include, cache misses, page faults, and interpro-
cessor communication [28].
10
On the other hand, dynamic scheduling approaches leave the assignments to the runtime.
Although this concludes higher scheduling overhead, these approaches could help with load
balancing. These methods include Pure Self-scheduling, Chunk Self-scheduling, Guided Self-
scheduling[29], Trapezoid Self-scheduling[30],[27], and Factoring[31].
In Pure Self-scheduling every time a processors becomes idle, it fetches one loop itera-
tion. This approach, while achieving a high load balance, imposes a considerable amount
of scheduling overhead when we are dealing with a fine-grain workload, and a large number
of iterations. Also frequent access to shared variables like loop index could lead to memory
contention[27].
In order to decrease the high scheduling overhead of Pure Self-scheduling methods, Chunk
Self-scheduling method assigns a certain number of iterations(called chunk size) to each idle
processor. This method trades lower scheduling overhead with higher load imbalance. Selec-
tion of the chunk size plays a very important role in the performance, as so a large chunk size
increases the scheduling overhead decreases and causes load imbalance, while a small chunk
size increases memory contention and scheduling overhead[27].
As an adaptive loop scheduling technique, Guided Self-scheduling[29] divides the remain-
ing number of iterations at each request evenly among the processors, and assigns it to the
processor that made the request, while updating the number of remaining iterations. This
causes larger number of iterations to be assigned to the processors at the beginning of the
loop execution, which results in lower scheduling overhead. The number of iterations as-
signed to each processor decreases as it approaches to the end of the execution, generating
tasks containing only one or two iterations, causing an increase in the scheduling overhead.
In order to tackle this issue, a minimum number of chunks could be set to avoid creation of
very small chunks[32].
Very similar to Guided Self-scheduling, Factoring[31] also decreases the chunk size as the
loop execution proceeds, with this difference that it does it in batches of equal sized chunks.
If the first iterations of the loop are more time consuming then the rest of the iterations, Fac-
11
toring performs better than Guided scheduling[33].
Along with the mentioned loop scheduling approaches, two other scheduling techniques
can be utilized for load balancing. Work stealing[34] lets the processors to steal work from
other processors queue, resulting in a more balanced load distribution. In work sharing on
the other hand, each time a processor creates new threads, the scheduler would try to migrate
some of them to other processors for a more balanced load distribution[34].
Each of the mentioned methods performs well for a specific problem. In this dissertation
we are interested in finding a general solution which can automatically decide on the grain
size or chunk size parameter for each application in order to achieve the best performance.
2.5. Blaze
Blaze Math Library[35] is a C++ library for linear algebra. Blaze, based upon Expression
Templates(ETs)[36], introduces "smart" expression templates(SETs)[35] to optimize the per-
formance for array-based operations. Expression Templates[36] is an abstraction technique
that uses overloaded operators in C++ to prevent creation of unnecessary temporaries, while
evaluating arithmetic expressions, in order to improve the performance[35]. The ET-based
approaches create a parse tree of the expression at compile time and postpone the actual
evaluation to when the expression is assigned to a target.
Although able to achieve promising performances for element-wise operations, these meth-
ods are not suitable for high performance computing for the following reasons. Due to their
abstraction from both the data type and also the operation itself, they do not allow optimiza-
tions specific to the type of the arrays, alongside the operation[35]. As a solution, Blaze pro-
poses smart ETs with these three main additions: integration with architecture-specific highly
optimized compute kernels, creation of intermediate temporaries when needed, and select-
ing optimal evaluation method automatically for compound expressions[35].
Some of the ET-based linear algebra libraries are: Blitz++[37], Boost uBLAS[38], MTL[39],
and Eigen[40]. Among these libraries, Eigen, MTL, alongside Blaze, impose different concep-
tual changes to ETs in order to make them suitable for HPC. Blaze also makes it possible
12
for programmers to utilize SIMD(Single Instruction Multiple Data) vectorization simply by
adding a compile time flag.
2.5.1. Parallelization in Blaze
Depending on the operation and the size of operands, this assignment could be paral-
lelized through four different backends, namely, HPX, OpenMP[12], C++ threads, and Boost[41].
Table 2.1 shows the default value for some of the threshold for parallelization applied to
operations performed in Blaze. It should be noted that these thresholds should be tuned based
on the parallelization backend and also the system architecture.
Table 2.1. List of some of the thresholds applied to the operations performed by Blaze, starting
from which the operation is executed in parallel
Benchmark Array size
DV ecDV ec ADD , DV ecDV ecMU LT 38000
DM atDM at ADD 36100 elements equivalent to a 175×175 matrix
DM atDM at MU LT 3025 elements equivalent to a 55×55 matrix
2.5.2. Implementation of HPX Backend
As stated earlier, as an ET-based library, Blaze performs the calculations when an expres-
sion is assigned to a target, which is implemented through the blaze::Assign function.
The four mentioned backends parallelize this assignment process through a parallel for-
loop, in which at each iteration a specific section of each of the vectors or matrices(called a
block) is selected and assigned to a core. Each core then performs the operation on the block
they have been assigned to.
Each backend uses their own method for parallelizing this for loop. For HPX backend,
current implementation uses a HPX parallel::for_loop with static chunking policy and chunk
size of 1. This way, knowing the number of cores to run the application on, we can divide the
original matrix equally among the cores, while the order of assignment of blocks to the cores
13
is known at compile time.
Listing 2.1 shows the current implementation of the HPX backend in Blaze.
What we suggest here is that, some prior knowledge for example, architecture of the sys-
tem we are running the application on, the expression that has to be executed, number of
cores of the system, size and type of the arrays we are dealing with, and etc. should be able to
help us to achieve a higher performance.
For this purpose we introduced two parameters block_si ze and chunk_si ze. Instead
of evaluating the expression on the whole matrix, we divide the matrices into blocks of size
bl ock_si ze and perform the operations on each of these blocks at each iteration. Since these
blocks are independent from each other, a number of these blocks could be assigned to each
core in order to exploit parallelism. chunk_si ze denotes the number of iteration, or the num-
ber of blocks of size block_si ze that are included in one task. Listing 2.2 shows the new
implementation of the HPX backend for Blaze based on these new definition.
2.5.3. Blazemark
Blazemark is a benchmark suite provided by Blaze to compare the performance of Blaze
with other linear algebra libraries including Blitz++[37], Boost uBLAS[41], GMM++[42], Armadillo[43],
MTL4[39], and Eigen3[44], alongside plain BLAS libraries like Atlas[45], Goto[46], and Intel
MKL.[47]
14
Listing 2.1: Previous implementation of Assign function for HPX backend in Blaze.
1 template< typename MT1 // Type of the l e f t −hand side dense matrix
2 , bool SO1 // Storage order of the l e f t −hand side dense matrix
3 , typename MT2 // Type of the right−hand side dense matrix
4 , bool SO2 // Storage order of the right−hand side dense matrix
5 , typename OP > // Type of the assignment operation
6 void hpxAssign ( DenseMatrix<MT1, SO1>& lhs , const DenseMatrix<MT2, SO2>& rhs , OP op )
7 {
8 using hpx : : p a r a l l e l : : for_loop ;




13 using ET1 = ElementType_t<MT1>;
14 using ET2 = ElementType_t<MT2>;
15
16 constexpr bool simdEnabled ( MT1 : : simdEnabled && MT2 : : simdEnabled && IsSIMDCombinable_v<ET1 , ET2> ) ;
17 constexpr s i z e _ t SIMDSIZE( SIMDTrait< ElementType_t<MT1> > : : s i z e ) ;
18
19 const bool lhsAligned ( (~ lhs ) . isAligned ( ) ) ;
20 const bool rhsAligned ( (~ rhs ) . isAligned ( ) ) ;
21
22 const s i z e _ t threads ( getNumThreads ( ) ) ;
23 const ThreadMapping threadmap ( createThreadMapping ( threads , ~rhs ) ) ;
24
25 const s i z e _ t addon1 ( ( ( (~ rhs ) . rows ( ) % threadmap . f i r s t ) ! = 0UL ) ? 1UL : 0UL ) ;
26 const s i z e _ t equalShare1 ( (~ rhs ) . rows ( ) / threadmap . f i r s t + addon1 ) ;
27 const s i z e _ t rest1 ( equalShare1 & ( SIMDSIZE − 1UL ) ) ;
28 const s i z e _ t rowsPerThread ( ( simdEnabled && rest1 ) ? ( equalShare1 − rest1 + SIMDSIZE ) : ( equalShare1 ) ) ;
29
30 const s i z e _ t addon2 ( ( ( (~ rhs ) . columns ( ) % threadmap . second ) != 0UL ) ? 1UL : 0UL ) ;
31 const s i z e _ t equalShare2 ( (~ rhs ) . columns ( ) / threadmap . second + addon2 ) ;
32 const s i z e _ t rest2 ( equalShare2 & ( SIMDSIZE − 1UL ) ) ;
33 const s i z e _ t colsPerThread ( ( simdEnabled && rest2 ) ? ( equalShare2 − rest2 + SIMDSIZE ) : ( equalShare2 ) ) ;
34
35 for_loop ( par , s i z e _ t ( 0 ) , threads , [ & ] ( i n t i )
36 {
37 const s i z e _ t row ( ( i / threadmap . second ) * rowsPerThread ) ;
38 const s i z e _ t column( ( i % threadmap . second ) * colsPerThread ) ;
39
40 i f ( row >= (~ rhs ) . rows ( ) | | column >= (~ rhs ) . columns ( ) )
41 return ;
42
43 const s i z e _ t m( min( rowsPerThread , (~ rhs ) . rows ( ) − row ) ) ;
44 const s i z e _ t n( min( colsPerThread , (~ rhs ) . columns ( ) − column ) ) ;
45
46 i f ( simdEnabled && lhsAligned && rhsAligned ) {
47 auto t a r g e t ( submatrix<aligned >( ~lhs , row , column , m, n ) ) ;
48 const auto source ( submatrix<aligned >( ~rhs , row , column , m, n ) ) ;
49 op( target , source ) ;
50 }
51 else i f ( simdEnabled && lhsAligned ) {
52 auto t a r g e t ( submatrix<aligned >( ~lhs , row , column , m, n ) ) ;
53 const auto source ( submatrix<unaligned >( ~rhs , row , column , m, n ) ) ;
54 op( target , source ) ;
55 }
56 else i f ( simdEnabled && rhsAligned ) {
57 auto t a r g e t ( submatrix<unaligned >( ~lhs , row , column , m, n ) ) ;
58 const auto source ( submatrix<aligned >( ~rhs , row , column , m, n ) ) ;
59 op( target , source ) ;
60 }
61 else {
62 auto t a r g e t ( submatrix<unaligned >( ~lhs , row , column , m, n ) ) ;
63 const auto source ( submatrix<unaligned >( ~rhs , row , column , m, n ) ) ;
64 op( target , source ) ;
65 }
66 } ) ;
67 }
15
Listing 2.2: New implementation of Assign function for HPX backend in Blaze.
1 template< typename MT1 // Type of the l e f t −hand side dense matrix
2 , bool SO1 // Storage order of the l e f t −hand side dense matrix
3 , typename MT2 // Type of the right−hand side dense matrix
4 , bool SO2 // Storage order of the right−hand side dense matrix
5 , typename OP > // Type of the assignment operation
6 void hpxAssign ( DenseMatrix<MT1, SO1>& lhs , const DenseMatrix<MT2, SO2>& rhs , OP op )
7 {
8 using hpx : : p a r a l l e l : : for_loop ;




13 using ET1 = ElementType_t<MT1>;
14 using ET2 = ElementType_t<MT2>;
15
16 constexpr bool simdEnabled ( MT1 : : simdEnabled && MT2 : : simdEnabled && IsSIMDCombinable_v<ET1 , ET2> ) ;
17 constexpr s i z e _ t SIMDSIZE( SIMDTrait< ElementType_t<MT1> > : : s i z e ) ;
18
19 const bool lhsAligned ( (~ lhs ) . isAligned ( ) ) ;
20 const bool rhsAligned ( (~ rhs ) . isAligned ( ) ) ;
21
22 const s i z e _ t threads ( getNumThreads ( ) ) ;
23 const s i z e _ t numRows ( min( s t a t i c _ c a s t <std : : s ize_t >( BLAZE_HPX_MATRIX_BLOCK_SIZE_ROW ) , (~ rhs ) . rows ( ) ) ) ;
24 const s i z e _ t numCols ( min( s t a t i c _ c a s t <std : : s ize_t >( BLAZE_HPX_MATRIX_BLOCK_SIZE_COLUMN ) , (~ rhs ) . columns ( )
) ) ;
25
26 const s i z e _ t rest1 ( numRows & ( SIMDSIZE − 1UL ) ) ;
27 const s i z e _ t rowsPerIter ( ( simdEnabled && rest1 ) ? ( numRows − rest1 + SIMDSIZE ) : ( numRows ) ) ;
28 const s i z e _ t addon1 ( ( ( (~ rhs ) . rows ( ) % rowsPerIter ) ! = 0UL ) ? 1UL : 0UL ) ;
29 const s i z e _ t equalShare1 ( (~ rhs ) . rows ( ) / rowsPerIter + addon1 ) ;
30
31 const s i z e _ t rest2 ( numCols & ( SIMDSIZE − 1UL ) ) ;
32 const s i z e _ t c o l s P e r I t e r ( ( simdEnabled && rest2 ) ? ( numCols − rest2 + SIMDSIZE ) : ( numCols ) ) ;
33 const s i z e _ t addon2 ( ( ( (~ rhs ) . columns ( ) % c o l s P e r I t e r ) ! = 0UL ) ? 1UL : 0UL ) ;
34 const s i z e _ t equalShare2 ( (~ rhs ) . columns ( ) / c o l s P e r I t e r + addon2 ) ;
35
36 hpx : : p a r a l l e l : : execution : : dynamic_chunk_size chunkSize ( BLAZE_HPX_MATRIX_CHUNK_SIZE ) ;
37
38 for_loop ( par . with ( chunkSize ) , s i z e _ t ( 0 ) , equalShare1 * equalShare2 , [ & ] ( i n t i )
39 {
40 const s i z e _ t row ( ( i / equalShare2 ) * rowsPerIter ) ;
41 const s i z e _ t column( ( i % equalShare2 ) * c o l s P e r I t e r ) ;
42
43 i f ( row >= (~ rhs ) . rows ( ) | | column >= (~ rhs ) . columns ( ) )
44 return ;
45
46 const s i z e _ t m( min( rowsPerIter , (~ rhs ) . rows ( ) − row ) ) ;
47 const s i z e _ t n( min( colsPerIter , (~ rhs ) . columns ( ) − column ) ) ;
48
49 i f ( simdEnabled && lhsAligned && rhsAligned ) {
50 auto t a r g e t ( submatrix<aligned >( ~lhs , row , column , m, n ) ) ;
51 const auto source ( submatrix<aligned >( ~rhs , row , column , m, n ) ) ;
52 op( target , source ) ;
53 }
54 else i f ( simdEnabled && lhsAligned ) {
55 auto t a r g e t ( submatrix<aligned >( ~lhs , row , column , m, n ) ) ;
56 const auto source ( submatrix<unaligned >( ~rhs , row , column , m, n ) ) ;
57 op( target , source ) ;
58 }
59 else i f ( simdEnabled && rhsAligned ) {
60 auto t a r g e t ( submatrix<unaligned >( ~lhs , row , column , m, n ) ) ;
61 const auto source ( submatrix<aligned >( ~rhs , row , column , m, n ) ) ;
62 op( target , source ) ;
63 }
64 else {
65 auto t a r g e t ( submatrix<unaligned >( ~lhs , row , column , m, n ) ) ;
66 const auto source ( submatrix<unaligned >( ~rhs , row , column , m, n ) ) ;
67 op( target , source ) ;
68 }
69 } ) ;
70 }
16
Figure 2.3. An example of the results obtained from running DV ecDV ec ADD benchmark
through Blazemark
17
Chapter 3. Data Collection and Analysis
In order to understand the relationship between number of cores, chunk_si ze, bl ock_si ze,
and the performance, we ran a series of experiments with different of these parameters and
measured the number of floating point operations per second performed.
3.1. Configurations
Our experiments were run on Marvin and Medusa nodes of Rostam cluster at Center for
Computation and Technology(CCT) at Louisiana State University. Table 3.1- 3.3 show some of
the specifications of these nodes.
Table 3.1. Specifications of the Marvin and Medusa node from Rostam cluster at CCT.
Node CPU RAM Number of Cores Hyperthreading
Marvin 2 x Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz 48 GB 16 Off
Medusa 2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 96 GB 40 Off
Table 3.2. Cache specifications of the Marvin node from Rostam cluster at CCT.
Cache Level Coherency Line Size Number of Sets Ways of Associativity Size
1 64 512 8 32KB
2 64 512 8 256KB
3 64 512 20 20480KB
Table 3.3. Cache specifications of the Medusa node from Rostam cluster at CCT.
Cache Level Coherency Line Size Number of Sets Ways of Associativity Size
1 64 512 8 32KB
2 64 1024 11 1024KB
3 64 40960 16 28160KB
18







For these experiments at the first step we selected the DM atDM at ADD benchmark which
was implemented in Blazemark. DM atDM at ADD benchmark is a level 3 BLAS function to
perform matrix-matrix addition in the form of A = B +C , where A, B , C are square matrices of
the same size. Throughout this dissertation matrix size of m represents a square matrix of size
m ×m.
To avoid adding the scheduling overhead for small matrix sizes, Blaze uses a threshold to
start parallelization, which is specific to the type of operation. For matrix-matrix addition, if
the number of elements in the matrix is greater than 36100 elements(which is equivalent to
a square matrix of size 190) Blaze uses the configured backend to parallelize the assignment
operation. For this reason, we start our experiments with matrix size of 200 and gradually
increase the size to 1587.
Table 3.5 show the matrix sizes and the number of cores chosen for our experiments with
DM atDM at ADD benchmark.
Table 3.5. List of different values used for each variable for running the DM atDM at ADD
benchmark
Category Configuration
Matrix sizes 200, 230, 264, 300, 396, 455, 523, 600, 690, 793, 912, 1048, 1200, 1380, 1587
Number of cores 1, 2, 3, 4, 5, 6, 7, 8
Number of rows in the block 4, 8, 12, 16, 20, 32
Number of columns in the block 64, 128, 256, 512, 1024
Chunk size Between 1 and total number of blocks (logarithmic increase)
19
Figure 3.2 shows the results of running DM atDM at ADD benchmark for matrix sizes and
number of cores listed in Table 3.5 based on grain size.
On the other hand, Figure 3.3 integrates the results obtained from running the same bench-
mark with different matrix sizes. Each color in this graph represents a specific matrix size.
Figure 3.1. The results obtained from running DM atDM at ADD benchmark through Blaze-
mark for matrix size 690 on different number of cores.
20
Figure 3.2. The results obtained from running DM atDM at ADD benchmark through Blaze-
mark for matrix of size 690 from two different angles
21
Figure 3.3. The results obtained from running DM atDM at ADD benchmark through Blaze-
mark for matrix sizes from 200×200 to 1587×1587
3.3. Observation
The final purpose of our experiments is to find a chunk size that gives us the best perfor-
mance for a given matrix size on a given machine. This chunk size should also be tailored to
the expression being executed, and this all is based on assuming that we have already fixed
the block size. So the first step appeared to be selecting the block size. For this purpose, we
ran the experiments with a selection of block sizes as shown in Table 3.5.
It should be mentioned that there were three constraints on selecting the block sizes. First,
Blaze forces the number of columns in a raw-major matrix to be divisible to SIMD register size
in order to be able to take advantage of vectorization. Second, we have selected the number
of columns in our blocks to be either divisible by cache line or to contain all the columns of
the matrix.
22
Figure 3.4. The results obtained from running DM atDM at ADD benchmark through Blaze-
mark for matrix size 690 with different combinations of block size and chunk size on 4 cores
The collected data, as seen in Figure 3.4, suggests two main points:
• For each selected block size, there is a range of chunk sizes that gives us the best perfor-
mance.
• Except for some uncommon cases, no matter which block size we choose, we are able
to achieve the maximum performance if we select the right chunk size.
This motivated us to move our search parameter from chunk size to grain size. As stated
earlier, grain size is the amount of work assigned to one HPX thread. Here we represent grain
size by number of floating point operations performed by a HPX thread. For example, per-
forming addition among two matrices, if we choose the block size as 4×64 and chunk size as
3, the grain size would be 3×4×64 = 768.
Note that in our experiments whenever the number of columns of the original matrix is
not divisible to the selected number of columns for block size, there would be a set of blocks
with less number of elements than the selected block size, this has been considered when
calculating the grain size.
By changing our focus to the grain size instead of the block size and the chunk size, Fig-
ure 3.5 shows how the throughput changes with regards to the grain size for the DM atDM at ADD
23
benchmark, for each specific block size. Each combination of block size and chunk size gen-
erates a point in the graph. On the other hand, Figure 3.1 looks at these graphs from another
aspect, keeping the problem size constant but changing the number of the cores to run the
benchmark on, instead.
Figure 3.5. The results obtained from running DM atDM at ADD benchmark through Blaze-
mark for matrix size 690 on 4 cores.
24
(a) matrix sizes smaller than 793
(b) matrix sizes larger than 793
Figure 3.6. Throughput vs. grain size graph obtained from running DM atDM at ADD bench-
mark on 4 cores.
25
Chapter 4. Polynomial Modeling of the Throughput in Terms of Grain Size
Looking at the throughput vs. grain size graphs and the consistent pattern observable
motivated us to try to model the relationship between throughput and grain size. In order
to simplify the process and eliminate the effect of different possible factors, we started with
limiting the problem to a fixed matrix size.
In our first attempt we used a 2nd degree polynomial to model throughput against grain
size. For each matrix size, we fitted the corresponding graphs shown in Figure 3.6 to a second
degree polynomial.
Figure 4.3 shows the results of using a quadratic function to fit the data for one matrix size
with different number of threads. We used the polyfit package from numpy library in python
for this purpose, which tries to minimize the least-square error over all the samples.
For our experiment, we divided the data into two sections, training and test. 60% of the
data was randomly chosen for the training part and the rest was considered as the test set.
The training set was used to find the best 2nd degree polynomial for the data, and once the
parameters were identified, the generated 2nd degree polynomial was applied to the test set
to measure how good our fit was performing.
For the matrix size 690 our dataset contained 117 data points, 72 of which was randomly
selected to build the model. The mean relative error for each number of cores, calculated
using equation 4.1, is represented in Figure 4.1 for training and test set. In this equation, ti
and pi denote the true value and the predicted value of the i th sample respectively, where n
is the number of samples with the particular number of cores.








Figure 4.1. The training and test error for fitting data obtained from the DM atDM at ADD
benchmark for matrix size 690 against different number of cores.
4.1. Generalizing the Fitted Function to Include Number of Cores
In this step, we try to generalize the fitted 2nd degree polynomial obtained from the previ-
ous step, represented by P = ag 2+bg +c, where P is the throughput and g is the grain size,by
looking at how the three parameters a, b, and c change when number of cores changes. A 3rd
degree polynomial seems to a reasonable fit for each of these parameters, in regard to number
of cores. In order to avoid overfitting, we excluded two of the data points(2 and 5) from the
data points used for fitting the polynomial and tested the fitted function on those two points
to see how well the function is working on unseen data points.
(a) parameter a (b) parameter b (c) parameter c
Figure 4.2. Fitting the parameters of the quadratic function with a 3rd degree polynomial from
the DM atDM at ADD benchmark for matrix size 690 against different number of cores.
27
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 4.3. The results of fitting the throughput vs grain size data into a 2d polynomial for
DM atDM at ADD benchmark for matrix size 690 with different number of cores on the test
data set.
28
Figure 4.4. The error in fitting the parameters a, b, and c for matrix size 690.
Using this 3rd degree polynomial to fit the parameters, we can generalize the relationship
between throughput and grain size in the following equation:
P = a11g 2N 3 +a10g 2N 2 + ...+a1N +a0 (4.2)
where P is the throughput, g is the grain size, and N is the number of cores and coefficients
a11, ..., a0 are the real values.
Knowing that a polynomial of degree 2 in terms of grain size and of degree 3 in terms of
number of cores, we can try to fit our original data directly to the above mentioned formula
(equation 4.2). The results of the original data obtained from running DM atDM at ADD
benchmark, the fitted polynomial based on equation 4.2, is represented in Figure 4.5, for 2,4,
and 8 cores for a matrix of size 690.
(a) 2 cores (b) 4 cores (c) 8 cores
Figure 4.5. Results of fitting the data from DM atDM at ADD benchmark with a polynomial of
degree 2 in terms of grain size and of degree 3 in terms of number of cores for matrix size 690.
29
(a) all data points included (b) leftmost sample excluded
Figure 4.6. The training and test error obtained from fitting the data to a polynomial of degree
2 in terms of grain size and of degree 3 in terms of number of cores for matrix size 690, for each
number of cores.
Figure 4.6a shows the obtained relative error on both training and test sets. The graph
suggests a higher test error compared to the training error, mostly caused by the left hand side
of the graph. The effect of removing the leftmost sample from error calculations is depicted in
Figure 4.6b.
Although we are interested in finding a model that results in a low training and test error,
our purpose is mainly finding the region that generates the highest performance. So, even
though our model might not match the original data in all data points, due to having a differ-
ent nature than a quadratic function, our focus would be on how this fit can help us to find
which range of grain sizes, or how big the task sizes should be, to achieve the highest perfor-
mance.
4.2. Finding the Range of Grain Sizes for Maximum Performance
The major advantage of using a quadratic function to fit the data in terms of grain size,
when number of cores is fixed, is the simplicity of the formula, which makes it possible for us
to find the peak of the graph very easily. n order to add some uncertainty to our prediction,
instead of finding the maximum of the quadratic function, we identified the range of grain size
that results in a performance within 10% of the maximum performance. For a second degree
polynomial in terms of g , P = ag 2 +bg + c, the minimum or maximum of the polynomial is
30
located at p∗ = −b2a , and a, b, c are 3rd degree polynomials of number of cores.
(a) 2 cores (b) 4 cores (c) 8 cores
Figure 4.7. The range of grain size (shown as the red line) that leads to a performance within
10% of the maximum performance.
(a) matrix size 690 (b) matrix size 523 to 912
Figure 4.8. The range of grain size within 10% of the maximum performance of the fitted
polynomial function for DM atDM at ADD benchmark for different number of cores.
Figure 4.8a shows the calculated range for matrix size 690 for each specific number of
threads, while Figure 4.8b compares the range for different matrix sizes.
4.3. Estimating the Optimum Range of Chunk Sizes
Once we identified a range of grain sizes that is expected to leads us to highest achievable
performances for a specific matrix size and a specific number of cores, the next step is finding
the possible combinations of block size and chunk size to achieve that range of grain sizes.
As stated earlier in this chapter, results obtained from Figure 3.5 suggests that with a fixed
grain size, our choice of block size does not affect the performance directly, as long as there
exist a chunk size that when combined by the block size could result in the specified grain size.
31
In our experiment, we selected our block size to be 4×256. With this assumption, in order
for the grain size to be within the specified range for each matrix size, chunk size has to be
within a specific range size too.
For example, for a 690 matrix we calculated the range of maximum performance for 4
cores to be [3.88,4.92] in logarithmic scale which is equivalent to [7586,83176]. Setting the
block size to 4×256, this range forces the chunk size to be within the range [9,90]. The range of
chunk sizes to match the range of grain sizes identified, and their corresponding throughput is
shown in Figure 4.9, for matrix size 690 and block size 4×256. The green line is the throughput
achieved by the current implementation of HPX backend. Since the graph from the original
data is skewed to right, we selected the point after the median of all the chunk sizes in the
range, as our candidate chunk size for this specific configuration.
(a) block size 4×256, 2 cores (b) block size 4×256, 4 cores (c) block size 4×256, 8 cores
(d) block size 4×512, 2 cores (e) block size 4×512, 4 cores (f) block size 4×512, 8 cores
Figure 4.9. The range of chunk sizes to produce a grain size within 10% of the maximum per-
formance of the fitted quadratic function for DM atDM at ADD benchmark for matrix size 690
with block size of 4×256 and 4×512 on 2, 4, and 8 cores. Silver points denotes the detected
range of chunk size, and the red star shows the median point.
32
4.4. Conclusions
In this chapter we offered a simple method to estimate the range of grain size, and conse-
quently the range of chunk size to lie in the region of the throughput against grain size graph,
with the highest throughput.
This quadratic model is very simple and finding the range of maximum performance is
very easy due to the nature of the function. But this model has some limitations.
First of all, it is not physical. The parameters of the model does not have a physical im-
plication. The other issue is that in this method one function was fitted for each matrix size
individually for simplification. Even though the characteristics of the function is the same for
all matrix sizes, the fitted parameters vary from one matrix size to another.
The matrix size should somehow be integrated into the model itself. For this purpose,
first we need to understand how changing the matrix size affects the relationship between
grain size and performance. One immediate effect of increasing the matrix size is an increase
in maximum possible grain size (right hand side of Figure 4.10 ) , while the minimum possible
grain size is the same.
Figure 4.10. Throughput vs grain size graph obtained from running DM atDM at ADD bench-
mark on 4 cores.
Moreover, Figure 4.11 shows how the predicted grain size range changes for different ma-
33
trix sizes for each number of cores.
Figure 4.11. The range of grain size within 10% of the maximum performance of the fitted
polynomial function for DM atDM at ADD benchmark for different number of cores for ma-
trix size 523 to 912.
Also, larger matrix sizes should be added to the experiments to validate the current results.
In the next chapter we look into a more robust and physical representation of how perfor-
mance changes with grain size, and consequently predicting the range of grain size and chunk
size to achieve the maximum performance.
34
Chapter 5. An Analytical Model for Predicting the Optimum Range of Grain
Size for Minimum Execution Time
In the previous section we studied the possibility of using a polynomial regression model
to capture the relationship between grain size, number of cores, and throughput for a fixed
matrix size, with the purpose of finding a range of grain size that leads us to maximum perfor-
mance. Although the polynomial function was helpful in directing us toward our objective of
finding the region of grain size with maximum performance, it lacked a physical implication,
and was not quite able to capture the overall behavior of the system.
This motivated us to change our view, and instead of looking only at the data and trying to
find a function to fit, we studied the behavior of the data, and then find a function that would
be likely to fit and explain the data. That function would be a good fit mostly because that’s
how we expect the throughput to change with grain size, and not solely how the data behaves.
In this chapter we attempt to understand the effect of grain size on the achievable speedup
in an asynchronous many-task runtime system, in order to develop an analytical model for
predicting the execution time in an asynchronous many task runtime system.
We have to note here that even if we were able to identify all the factors affecting the exe-
cution time, it is still very hard to find an analytical model describing the relationship between
these factors and the execution time. In this chapter we explain our effort in this direction by
starting from a simple benchmark. We create a formula based on our knowledge of the be-
havior of the system, and the collected data. Afterwards, we try to generalize the proposed
formula to arbitrary parallel for-loops with balanced work-load for each iteration.
Table 5.1 shows the notations used throughout this section. It should be noted that the
amount of work is measured in terms of execution time or floating point operations depend-
ing on the application. Chunk size is defined as the number of iterations included in one task,
while grain size is the amount of work contained in a task, to be executed by one user-level
thread.
In an attempt to find this analytical model, we started by looking into two factors that
35
Table 5.1. Table of Notations
Parameter Definition
N Number of cores
nt Total number of tasks created
ps Total amount of work
wmax Maximum amount of work assigned to a single core
tmax Execution time of wmax
M Number of utilized cores
tseq Sequential execution time
ti Execute time of one iteration
n Number of loop iterations
cs Chunk size
g Grain size
we believe play the most important role in determining the execution time: the overhead of
scheduling the tasks, and the maximum amount of work assigned to one core. In order to
study the effects of these factors on execution time of a balanced parallel for-loop, we define
iteration length, ti , as the time it takes to execute one iteration, n as total number of loop iter-
ations, cs as the chunk size which is the number of iterations to be executed by one user-level
thread. The grain size, g , then refers to the amount of work assigned to one task. Depending
on the parallel program being executed, this parameter has to be quantified somehow.
The total execution time could be assumed to be roughly the amount of time it takes for
the core with the maximum amount of work to execute the tasks it has been assigned to. Here
we denote the core with maximum expected amount of work as cor e0, and the amount of
work that has been assigned to it as wmax .
With this assumption, the main factors contributing to the execution time would be the
overhead of scheduling tasks on cor e0, the time it takes to run wmax amount of work on cor e0,
denoted with tmax , and the number of cores that will be executing the work(M). Depending
on the amount of work available, either all N cores or less than N cores will be performing the
work.
36
Equation 5.1 shows the expected formula in it’s simplest form.
executi on_t i me = tover head + tmax (5.1)
In equation 5.1, tover head represents the penalty that we have to pay for running the program
in parallel. We hold three major factors accountable for this overhead.
The first factor is the overhead of scheduling the tasks. Although this overhead is negligi-
ble for a small number of tasks, it becomes significant as the number of created tasks becomes
larger.




tasks would be created on cor e_0. If





portion of t_over head associated with n_t tasks.
The second factor is the overhead caused by work stealing. Although work stealing is a
very helpful method for load balancing, it induces an overhead due to constant efforts of idle
cores to steal work from the other cores. Each of these efforts would result in trashing the local
cache of the busy cores. Work stealing helps for load balancing when the number of tasks
created is greater than or equal to the number of cores, but as the number of tasks get smaller
than the number of cores, the effect of unsuccessful steal efforts becomes more noticeable.
HPX, by default, uses the priority local scheduling policy which creates one queue for each
core. When there is no more work left in one core’s queue, it will attempt to steal work from
other cores, starting from it’s neighbors. If the attempt is not successful, it will move on to the
next level neighbors. This continues until the core that initiates the steals(called the thief) is
able to find a core with tasks in their queue which can be stolen(called the victim).
HPX provides an optional command line parameter −−hpx : numa − sensi t i ve to make
sure that the thief would try the queues of the cores in the same NUMA domain first. This
option is provided based on the fact that it is much faster to access the local memory of a




Figure 5.1. The results of running the parallel for-loop benchmark with ps = 3000, on 4 and 8
cores, with and without work stealing. The vertical gray line shows the grain size that would
generate the same number of tasks as the number of cores, with the same amount of work for
all the cores.
38
Figure 5.1 shows the results of running a parallel for-loop benchmark with and without
work stealing. The benchmark consists of an HPX parallel for-loop with 3000 iterations and
is executed with different number of chunk sizes from 1 to 3000. Each iteration will keep the
core busy for 1µsec. This benchmark is explained more in the next section.
As it can be observed, in the region where the number of tasks created is same as the
number of cores, this effect is more significant.
Although we are not sure how work stealing is causing this effect, the effect is most notice-
able at certain points. Since it only adds to the execution time at those points, and our interest
is finding the flat region of the graph with the minimum execution time, at this point and for
this purpose we decided to ignore this term in our upcoming calculations.
The third factor is the overhead due to contention and coherency based on USL. Equation
5.2 shows how USL models the overheads due to contention(σ) and coherency(κ) in overall
execution time t , with sequential execution time of tseq , on N cores, when
tseq
N is the expected




1+σ(N −1)+κN (N −1) ⇒
t = tseq
N
+ σ(N −1) tseq
N
+ κN (N −1) tseq
N
(5.2)
Based on equation 5.2, we added the term σ(N − 1)tmax to equation 5.6 to represent the
overhead observed due to the contention effect, assuming tmax is the ideal execution time in
this problem. We ignored the overhead due to coherency since we do not expect to experience





+ σtmax(N −1) (5.3)
Moreover, we adjusted equation 5.3 by changing the total number of cores (N ) to the number
of cores that are actually executing the work (M). Since there are cases where there is not
enough work for all the cores to execute, and we expect overheads due to contention to be a
39
factor of the cores that are actually performing the work and not just all the cores.
Assuming that we are running our application on N cores, with a grain size equal to g , n_t
tasks are being created, and M cores are actually doing the work. If nt < N , M would be equal
to n_t , otherwise M = N .
M =

nt if nt < N
N otherwise
(5.4)
With these assumptions 5.3 then converts into:




+ tmax + σtmax(M −1) (5.5)
With this assumption equation 5.1 then becomes:





We had previously defined tmax as the time it takes to perform wmax amount of work, where
wmax is the maximum amount of work assigned to the cores.
For a balanced parallel for-loop, wmax can be calculated as:
wmax =

ps if N = 1












When N = 1, all the work will be assigned to the only available core, causing wmax to be equal





assigned to a core. Therefore, with a grain size of g each task would contain g amount of




assigned to a core.
40
It should be noted that the last chunk of work created could contain a smaller amount of
work than the cs , and this is the case when n%cs 6= 0. This, although causing a smaller amount
of work to be assigned to one of the cores, would not affect the maximum amount of work
assigned to the cores, unless when nt %N = 1. For these special cases where nt %N = 1 and




Figure 5.2 illustrates how wmax is calculated for this case. Each square represents a chunk,
or one task, containing g amount of work generating n_t chunks in total, and each column




⌉−1)+ g ′ = ps .
Figure 5.2. Distribution of nt tasks of size g among N cores.
As for tmax , the time to execute w_c amount of work, where total amount of work available
is ps , since we are looking into balanced parallel for-loops with almost same amount of work
41
at each iteration, we can estimate tmax as:
tmax = tseq wmax
ps
(5.8)
Where tseq is the to run the whole amount of work, ps , sequentially. Equation 5.3 is then
simplified into:






(1 + σ(M −1)) (5.9)
We refer to equation 5.9 as our proposed analytical model in the next sections.
In summary, we deducted the following list as determining factors of the execution time
in a balanced parallel for-loop:
• number of tasks created
• number of cores
• the maximum amount of work one core has to perform
• The maximum number of tasks assigned to one core, and the number of cores that are
actually performing the work are two other important factors that can be deducted from
the aforementioned factors.
In addition, the two factors α and σ represent the system-architecture that is constant
between runs. Invariability of the architecture assists us to better predict the behavior of the
same program with different input data as well as other new programs.
5.1. Evaluating the Proposed Model
In order to evaluate the model, we developed a benchmark of HPX parallel for-loops as
shown in Listing 5.1 and ran it on a wide range of iteration lengths, chunk sizes, and number
of threads. We refer to this benchmark as the for-loop benchmark. In each iteration, a while
loop keeps the core busy for a certain amount of time denoted as ti .
42
Listing 5.1: For-loop benchmark, a simple hpx for_loop used to study the effect of grain size
on the achieved parallelism.
1
2 // / // // // // /// // // // // /// // // // // /// // // // // /// // // // // /// // // // // // /// // // // //
3 void measure_function_futures_for_loop ( std : : uint64_t count , bool csv , std : : uint64_t chunk_size , std : : uint64_t
i t e r _ l e n g t h )
4 {
5 // s t a r t the clock
6 high_resolution_timer walltime ;
7 hpx : : p a r a l l e l : : for_loop ( hpx : : p a r a l l e l : : execution : : par . with (
8 hpx : : p a r a l l e l : : execution : : dynamic_chunk_size ( chunk_size ) ) ,
9 0 , count , [ & ] ( std : : uint64_t ) { worker_timed ( i t e r _ l e n g t h *1000) ; } ) ;
10 // stop the clock
11 const double duration = walltime . elapsed ( ) ;
12 p r i n t _ s t a t s ( " for_loop " , "par" , " paral le l_executor " , count , duration , csv ) ;
13 }
14
15 // / // // // // /// // // // // /// // // // // /// // // // // /// // // // // // /// // // // // /// // // // //
16 i n t hpx_main ( variables_map& vm)
17 {
18 const i n t r e p e t i t i o n s = vm[ " r e p e t i t i o n s " ] . as<int >( ) ;
19 num_threads = hpx : : get_num_worker_threads ( ) ;
20 const std : : uint64_t chunk_size = vm[ " chunk_size " ] . as<std : : uint64_t > ( ) ;
21 const std : : uint64_t i t e r _ l e n g t h = vm[ " i t e r _ l e n g t h " ] . as<std : : uint64_t > ( ) ;
22 const std : : uint64_t count = vm[ " num_iterations " ] . as<std : : uint64_t > ( ) ;
23 bool csv = vm. count ( " csv " ) != 0 ;
24 i f (HPX_UNLIKELY(0 == count ) )
25 throw std : : l o g i c _ e r r o r ( " error : count of 0 futures speci f ied \n" ) ;
26 for ( i n t i = 0 ; i < r e p e t i t i o n s ; i ++)
27 {
28 measure_function_futures_for_loop ( count , csv , chunk_size , i t e r _ l e n g t h ) ;
29 }
30 return hpx : : f i n a l i z e ( ) ;
31 }
32 // / // // // // /// // // // // /// // // // // /// // // // // /// // // // // // /// // // // // /// // // // //
33 i n l i n e void worker_timed ( std : : uint64_t delay_ns )
34 {
35 i f ( delay_ns == 0)
36 return ;
37 std : : uint64_t s t a r t = hpx : : u t i l : : high_resolution_clock : : now( ) ;
38 while ( true )
39 {
40 // Check i f we ’ ve reached the speci f ied delay .




45 // / // // // // /// // // // // /// // // // // /// // // // // // /// // // // // /// // // // // /// // // // //
46 i n t main( i n t argc , char * argv [ ] )
47 {
48 // Configure application−s p e c i f i c options .
49 options_description cmdline ( "usage : " HPX_APPLICATION_STRING " [ options ] " ) ;
50 cmdline . add_options ( ) ( " num_iterations " ,
51 value <std : : uint64_t > ( )−>default_value (500000) , "number of i t e r a t i o n s to invoke " )
52 ( " r e p e t i t i o n s " , value <int > ( )−>default_value ( 1 ) ,
53 "number of r e p e t i t i o n s of the f u l l benchmark" )
54 ( " i t e r _ l e n g t h " , value <std : : uint64_t > ( )−>default_value ( 1 ) , " length of each i t e r a t i o n " )
55 ( " chunk_size " , value <std : : uint64_t > ( )−>default_value ( 1 ) , "chunk s i z e " ) ;
56 // I n i t i a l i z e and run HPX.
57 return i n i t ( cmdline , argc , argv ) ;
58 }
43
By setting ti = 1µsec, and changing number of iterations n, and chunk sizes cs , we can
see how the execution time changes when the benchmark is executed on different number of
cores.
Having defined ps as the total amount of work that has to be performed, for this bench-
mark ps would be the time it takes to execute all the iterations, which is ti × n. Since we
have set ti = 1µsec in this benchmark, ps = n. On the other hand, for this specific problem
wmax = tmax , and tseq = ps .
In order to test how well the suggested model captures the relationship between execu-
tion time and grain size, we ran the benchmark with different configurations of number of
cores(N ), number of iterations(n), and chunk sizes (cs). For each n, we change cs from 1 to n
in logarithmic scale. Each of these runs was executed on 1,2,3, ...,8 cores.
Figure 5.3 shows the measured execution time in terms of grain size for ps = 100000, on
different number of cores.
Figure 5.3. The results of running the parallel for-loop benchmark with ps = 100000, on differ-
ent number of cores.
Using the collected data points for each problem size (ps), we utilized the optimize.curve_-
fit package from sci py library in Python to fit our model to the collected data. Figure 5.4 shows
44
the fitted model and the original data for ps = 100000, on 8 cores.
Figure 5.4. The results of predicted values of execution time through curve fitting vs the real
data for ps = 100000, for 8 cores.
To evaluate our model, we measured the relative error and R2 score of the prediction for
each problem size based on equation 5.10 and 5.11. In these equations pk is the predicted
value he sample k, tk is true measured value, and K is the total number of samples.











k=1 (tk −pk )2
V ar (t )
(5.11)
Figure 5.5 illustrates the calculated relative error and R2 score for problem sizes (ps) of 10000,
100000, 1000000, 10000000, 100000000 iterations. For each of the pss, 440, 512, 440, 512, and
728 data points were collected respectively, by running the benchmark with a range of dif-
ferent chunk sizes to cover different regions of grain size. The relative error and R2 score are
calculated for each specific number of cores individually.
As discussed earlier, this model depends on the number of cores, tasks created on them,
and the ratio of the maximum workload assigned to a single core and the problem size. Model
parameters, α and σ, depend on the system architecture and type of the parallel application,
45
and we wouldn’t expect them to change for different problem sizes.
Therefore, it would be sufficient to rely on collected data from one problem size rather
than all measured data to find parameters’ α and σ values.
In the next step, we used the parameters identified through curve-fitting based on the
data collected from one specific problem size, to estimate the predicted execution time of
other problem sizes, and measured the relative error and R2 score. The obtained results are
illustrated in Figure 5.6.
As it can be seen, using the data collected for ps = 1000 to estimate the model parameters
would generate higher prediction error on other problem sizes, but for other problem sizes we
don’t see a considerable change in prediction error when data from different problem sizes
were used to estimate the model parameters. Since larger problem sizes require more data





Figure 5.5. The relative error, and R2 score of fitting the measured data for different problem




Figure 5.6. The relative error and R2 score of using the model parameters estimated based
on the data collected from each problem size, to predict the execution time of other problem
sizes, calculated over each number of cores.
Fitting our model to all the 512 data points collected resulted in model parameters α =
2.416 and σ = 0.0251, where α represents the overhead of scheduling the tasks in µsecs, and
σ represents the contention defined in the Universal Scalibility Law.
We run the for-loop benchmark on the system we want to run our parallel application on
for ps = 100000 to estimateα andσ. Plugging the estimated parameters into ( 5.9 would create
the analytical model that could be used for other balanced parallel for-loop applications on
48
the same system.
Figure 5.7 shows the fitted curves along with the original data for different number of
cores.
5.2. Testing the Proposed Model on Other Architectures
On the next step we repeated the same process for the data collected from another archi-
tecture, Medusa node with a Skylake CPU. We used ps = 100000 as the base p_s on both ar-
chitectures. On Medusa, with ps = 100000, the model parameters were found to be α= 1.673
and σ= 0.023. The calculated relative error and R2 score for each individual number of cores
for ps = 100000 is demonstrated in Figure 5.9, and Figure 5.10, while the relative error and R2
score calculated for all other problem sizes is shown in Figure 5.9, and Figure 5.10.
49
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.7. The results of predicted values of execution time through curve fitting vs the real
data for ps = 100000, for different number of cores.
50
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.8. The results of predicted values of execution time through curve fitting vs the real




Figure 5.9. The relative error, and R2 score of fitting the collected data to the proposed an-





Figure 5.10. The relative error and R2 score of using the model parameters found from each
base ps fitting the proposed analytical model to the collected data for different ps , calculated
over each number of cores on Medusa node.
53
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.11. The results of predicted values of execution time through curve fitting vs the real
data for ps = 100000, for different number of cores on Medusa node.
54
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.12. The results of predicted values of execution time through curve fitting vs the real
data for ps = 100000000, for different number of cores on Medusa node.
55
5.3. Identifying the Optimal Range of Grain Sizes
The proposed model suggests a range of grain sizes resulting in the optimum execution
time for a given application running on a specific architecture. As illustrated earlier in Figure
2.1, the execution time versus grain size graph in logarithmic scale, denoted as the bathtub
curve, can be divided into three regions. We refer to these regions as the left side, the right
side, and the flat regions of the graph. Figure 5.13 shows an example of the flat region of
the execution time versus grain size graph in both linear and logarithmic scales. As it can be
observed, the flat region contains a small section of range of different grain sizes.
At the right side of the graph with large grain sizes, the number of tasks created is smaller
than the number of cores which results in idling at least one of the cores, while the other cores
are assigned a rather big chunk of work. The performance degradation we observe at those
points is associated with starvation. At these points, the number of cores actually performing
the work is equal to the number of tasks, since each core gets to execute at most one task. At
this region of the graph, the time to execute the maximum assigned work to a core (tmax) is
the dominant factor in determining the execution time.
On the other hand, on the left side of the graph with small grain sizes, we end up with
creating a large number of tasks. Since there is an overhead associated with each created
task, we observe a performance degradation in that region. As the grain size increases, the
number of created tasks, and the overhead associated to that decreases consequently. At this
region of the graph, the overhead of creating and managing the tasks is the dominant factor
in determining the execution time.
The third region of the graph, which we referred to as the flat region of the graph, is the
area between the two mentioned areas where we observe very little change in the execution
time. At this region, the overhead associated with creating and managing the tasks is amor-
tized by the execution time and since all the cores are utilized we do not observe a perfor-
mance degradation due to starvation.
Our intention in this dissertation is to find this flat region of the bathtub curve, as shown
56
in Figure 5.13a, in order to ensure that the effect of the two major sources of performance
degradation (overheads of scheduling tasks, and starvation) in execution time is minimized.
(a) logarithmic scale
(b) linear scale
Figure 5.13. The results of running the for-loop benchmark with ps = 100000, on 8 cores in
logarithmic and linear scale.
5.3.1. Left side of the Graph




) is the dominant
factor while the second term(tseq
wmax
ps
(1 + σ(M −1))) roughly stays constant. Likewise, for
large grain sizes the second factor is the dominant factor.
57
In order to find the lower-bound of the range for which the execution time stays constant,
we can assume that the second factor is constant in that region. Also we can change N to M ,
knowing that our concern is on the left side of the graph, where n_t is definitely greater than
the number of cores. Taking the derivative of the function based on the grain size then leads
to:
















From Formula 5.12, it can be observed that for the left side of the graph the rate of changes
is negative and decreases as the grain size increases. Here we are looking for the value of the
grain size for which the rate of change becomes very small (we introduce a thresholdλb , where
















Formula 5.13 can also be represented as shown in Formula 5.14. This representation
shows that when the ratio of the time it takes to execute one task to the total overhead of
creating and managing nt tasks on N core, is greater than a threshold, we will end up in the























5.3.2. Right side of the graph
Now looking at the right side of the graph, the overhead of creating the tasks becomes
negligible on that side, since only few tasks are being created and the overhead of creating
these many tasks is not significant compared to the execution time. On this side, the max-
imum amount of work executed by one core (wmax) and consequently the time to perform
this amount of work (tmax) is the dominant factor.
Formula 5.7 shows how wmax is calculated for different cases, but in general we can esti-
















if nt ≥ N
(5.15)





same but since the grain size is different, a different wmax would be resulted. For all the values




, as g increases the difference between wmax and
ps
N increases.
For example, considering a case where ps = 100000, and N = 8, for the grain sizes in range
of [4167, 6249] would result in creating between
⌈100000
6249
⌉ = 17 and ⌈100,0004,167 ⌉ = 24 tasks. This
amount of tasks created itself would result in
⌈17
8
⌉ = 3 and ⌈248 ⌉ = 3 tasks. On the other hand,
wmax = g ×
⌈nt
N
⌉ = 3× g , would have a value in range of [3×4167, 3×6249] = [12501, 18747],
where the average amount of work per core is psN = 12500. This means that for grain sizes closer
to the end of the range, we are observing that a much bigger amount of work is assigned to the
core with maximum amount of work, which would result in a higher execution time.
59





k −1 < nt
N
≤ k

















≤ g ≤ ps .
(5.17)
Otherwise, when k > 1,
ps
k ×N ≤ g <
ps




⌉= k, and wmax = g ×⌈ntN ⌉= k × g if nt %N 6= 1, we can conclude for k > 1:
k × ps




















= k & nt %N = 1 ⇒
nt = (k −1)×N +1 ⇒
(k −1)×N < ps
g
≤ (k −1)×N +1 ⇒
ps





From Formula 5.7 we know,
wmax = ps − (k −1)× (N −1)× g . (5.21)
Therefore,
(k −1)(N −1) ps









< w_max ≤ k × ps
(k −1)N +1
(5.22)
And for k = 1, where nt ≤ N ,
wmax = g , ps
N
≤ g ≤ ps ⇒
0 ≤ wmax − ps
N
= g − ps
N
≤ (N −1)× ps
N
(5.23)






0 ≤ i mbal ance_r ati o ≤ N −1 for k = 1
0 ≤ i mbal ance_r ati o < 1
k −1 for k > 1 and nt %N 6= 1
0 ≤ i mbal ance_r ati o < N −1





Formula 5.24 shows that as number of created tasks increases, as long as number of tasks
per core is the same, the imbalance factor decreases.
Figure 5.14 shows the imbalance ratio calculated for different grain sizes for ps = 10000,
on 8 cores. Each of the regions between two dashed green lines correspond to a specific value
for k = ⌈ntN ⌉.





for regions where nt %N 6= 1) at the end of the region. When k = 1,
i mbal ance_r ati o increases linearly starting from 0 and reaching the maximum of N − 1




decreases, therefore the upper-bound
for i mbal ance_r ati o increases.
Figure 5.14. The imbalance ratio calculated for different grain sizes for ps = 100000, on 8 cores.
At the area between each two green lines k = ⌈ntN ⌉ is constant.
Figure 5.15 represents the imbalance ratio, along with the ratio of the sequential execution
time over execution time(speed-up) against grain size for ps = 100000, ran on 8 cores. As it can
be observed, as the i mbal ance_r ati o increases, the speed-up decreases.
62
Figure 5.15. The imbalance ratio alongside speedup for different grain sizes for ps = 100000,
on 8 cores, where k = ⌈ntN ⌉.
To summarize, as the grain size increases, the maximum imbalance in the loads assigned
to the cores also increases, and some point on, this imbalance has a significant affect in the
execution time. We define a threshold, λs (0 < λs < 1), where for i mbal ance_r ati os smaller
than this threshold the imbalance effect is not significant. As we get close to this threshold, we
are likely to reach the right side of the flat region of the bathtub curve of the execution time
against grain size.
We are interested in finding the maximum grain size that would generate a reasonable
imbalance (i mbal ance_r ati o ≤ λs), to make sure we should stay in the flat region of the
bathtub curve of execution time against grain size, from load imbalance point of view.
Formula 5.23 states that for grain sizes greater than ps , i mbl ance_r ati o increases lin-
early with grain size from 0 to N − 1. While for grain sizes smaller than ps , the maximum
i mbal ance_r ati o depends on k = ⌈ntN ⌉. So, in order to ensure imbalance_ratio is smaller
than or equal to a threshold (λs), first we search the grain sizes smaller than
ps
N . Since 0 <λs <
1, and k ≥ 2 in this region, there exists a k such that 1k−1 ≤ λs . If there exists a kmi n (creat-
ing an imbalance ratio between 0 and 1kmi n−1 ), where
1
kmi n−1 ≤ λs , ∀k < kmi n maximum value
63
of i mbal ance_r ati o would be greater than λs . So in order to find the grain size that would
create maximum i mbal ance_r ati o of λs :
i mbal ance_r ati o ≤λs ⇒
1
k −1 ≤λs




















If g < gmax , we can ensure that i mbal ance_r ati o never exceeds λs . Since we already
found a match at grain sizes smaller than psN , checking the rest of grain sizes would not be
necessary.
5.4. Identifying the Range of Grain Size for Minimum Execution Time
In the previous section, we proposed a method to identify the lower-bound and the upper-
bound of the grain sizes for which we expect to observe the minimum execution time. Inte-












Where 0 ≤λs ≤ 1, and λb ,λs ¿ 1.














Which shows that the identified range of minimum execution time is located at the left
side of the grain size that equally divides the work between the cores, psN .
64
5.5. Locating the Optimum Range of Grain Sizes for the for-loop Benchmark
In this section we used Formula 5.26 to identify the flat region of the execution time vs
grain size graph for the parallel for-loop benchmark in Listing 5.1. For this purpose, we set
λb to 0.01 and λs to 0.1. The minimum and maximum acceptable value for grain size to lay
in the flat region of the graph are then calculated based on Formula 5.27. λb indicates the
slope of the graph at the left side of the graph where overhead of tasks is the dominant factor.













could generate an i mbal ance_r ati o of greater than λs . Therefore, for grain
sizes within the range in Formula 5.27 we are in a safe region.
Figure 5.16 and Figure 5.17 show the identified regions for minimum execution time, for
ps = 10000, 100000, 1000000, 10000000, 100000000, with 8 cores on two different architectures.
The gray dashed line represents the grain size where work is equally divided among the cores,
ps
N .
Selecting a greater value for λb would move the left border of the region to left, for a larger
acceptable slope of change of execution time in terms of grain size. On the other hand, select-
ing a smaller value forλs would result in shifting the right border of the region to left, imposing
a higher restriction on i mbal ance_r ati o, as shown in Figure 5.18(ran on M ar vi n node) and
Figure 5.27(ran on Medusa node).
65
(a) ps = 10000 (b) ps = 100000
(c) ps = 1000000 (d) ps = 10000000
(e) ps = 10000000
Figure 5.16. The identified range of grain size for minimum execution time for ps = 10000,
100000, 1000000, 10000000, 100000000, on 8 cores, with λb = 0.01 and λs = 0.1 on M ar vi n
node. The gray dashed line represents the grain size where work is equally divided among the
cores, psN .
66
(a) ps = 10000 (b) ps = 100000
(c) ps = 1000000 (d) ps = 1000000
(e) ps = 10000
Figure 5.17. The identified range of grain size for minimum execution time for ps = 10000,
100000, 1000000, 10000000, 100000000, on 8 cores, with λb = 0.01 and λs = 0.1 on Medusa
node. The gray dashed line represents the grain size where work is equally divided among the
cores, psN .
67
Figure 5.18. An example of the effect of λb and λs on the borders of the identified region for
minimum execution time due to change for ps = 100000000 and N = 8 on Marvin node.
Figure 5.19. An example of the effect of λb and λs on the borders of the identified region for
minimum execution time due to change for ps = 100000000 and N = 8 on Medusa node.
68
5.6. Applying the Proposed Model to the Blazemark Data
In the previous section we proposed an analytical model to predict how execution time
changes with grain size, and suggested a method to estimate the range of grain size for a spe-
cific ps ran on a certain number of cores to stay in the region of minimum execution time.
We then tested the proposed model on a simple parallel for-loop benchmark, and the results
seemed promising.
In this section we revisit our original problem of finding the range of chunk size for min-
imum execution time to calculate an expression using the Blaze linear algebra library. We
start by looking at the data collected from DM atDM at ADD benchmark from the Blazemark
suite. The benchmark calculates C = A +B where A, B , and C are row major square dense
matrices for 60 different matrix sizes from 1 to 7000. Since the parallelization threshold for
DM atDM at ADD is set to 36100 elements which is equivalent to a matrix of size 193 by de-
fault, we used matrix sizes from 230 to 7000 in our experiments. For each matrix size the
expression is evaluated 6 times and the average execution time of the last 5 runs is returned as
the output.
As we previously stated in Chapter 5, for the collected data, the DM atDM at ADD bench-
mark was executed with different configurations of bl ock_si ze, chunk_si ze, and number of
cores, for different matrix sizes. As discussed earlier the collected data suggested that instead
of looking at block_si ze and then chunk_si ze, we can study the effect of grain size on ex-
ecution time. By identifying the range of grain size that ensures staying in the flat region of
the bathtub, we can then select a reasonable value for bl ock_si ze, we can find the range of
chunk_si zes to generate the mentioned range of grain size. We define a reasonable value for
block size for an operation on row major matrices, as the size of a matrix with larger number
of columns than number of rows, where number of columns is divisible by cache line, and
the number of elements in the block multiplied by number of matrices involved in an opera-
tion, is smaller than L2 cache size. With these assumptions we selected the value of 4×256 for
bl ock_si ze for this benchmark.
69
In the previous section we provided Formula 5.5 as an analytical model to explain the
relationship between execution time and grain size in a balanced parallel for-loop.




+ tseq × wmax
ps
× (1 + σ× (M −1)) (5.28)
In Formula 5.5, ps refers to the total amount of work available. If we denote matrix size
with m, performing DM atDM at ADD benchmark would result in ps = 2×m2, which is the
total number of floating point operations needed to be performed.
Previously, using the benchmark we were able to find the value for the parameters of the
model on M ar vi n node as α = 2.416 and σ = 0.0251. Grain size, as the amount of work ex-
ecuted by one thread, for this problem is represented by the number of floating point opera-
tions performed by one thread. As an example, if bl ock_si ze = 4×256, and chunk_si ze = 3
performing DM atDM at ADD benchmark, would result in a grain size of 2×4×256×3 = 6144.
Figure 5.20 and Figure 5.21 show the results of applying the proposed analytical model
with the parameters identified through the benchmark, to DM atDM at ADD benchmark,
alongside the true values for the execution time, for m = 690 and m = 4222 respectively, for
different number of cores.
The DM atDM at ADD benchmark was run on M ar vi n node with Sand y Br i d g e archi-
tecture, with 256K B of level 2 cache, and 20MB of level 3 cache. In order for the 3 same sized
square matrices of type double involved in DM atDM at ADD benchmark (C = A +B) to fit
into level3 cache, they must be smaller than 935×935 in size.
For matrix sizes greater than 935, we observe an increase in execution time from our pre-
diction using the proposed analytical model, as represented in Figure 5.21. We have to pay a
penalty for loading data from memory instead of cache.
Figure 5.22 shows the relative error, while Figure 5.23 represents the R2 score both cal-
culated for each specific number of cores individually, and among matrix sizes smaller than
953.
70
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.20. The results predicting the execution time in terms of grain size based on the
proposed model compared to the original values for DM atDM at ADD benchmark with m =
690 on M ar vi n node, ran on different number of cores.
71
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.21. The results predicting the execution time in terms of grain size based on the
proposed model compared to the original values for DM atDM at ADD benchmark with m =
4222 on M ar vi n node, ran on different number of cores.
72
Figure 5.22. The relative error of predicting the execution time in terms of grain size based on
the proposed model for DM atDM at ADD benchmark on M ar vi n node. The relative error
of matrix sizes smaller than 953 was averaged for each specific number of cores.
Figure 5.23. The R2 score of predicting the execution time in terms of grain size based on the
proposed model for DM atDM at ADD benchmark on M ar vi n node. The relative error of
matrix sizes smaller than 953 was averaged for each specific number of cores.
For Medusa node with Sk yl ake architecture, with 1024K B of level 2 cache, and 28MB
of level 3 cache, in order for the matrices involved in the DM atDM at ADD benchmark to fit
into level3 cache, they must be smaller than 1097×1097 in size.
For matrix sizes greater than 1097, we observe an increase in execution time from our
prediction using the proposed analytical model, as represented in Figure 5.21. We have to pay
a penalty for loading data from memory instead of cache.
Figure 5.22 shows the relative error, while Figure 5.23 represents the R2 score both cal-
73
culated for each specific number of cores individually, among for matrix sizes smaller than
1097.
74
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.24. The results predicting the execution time in terms of grain size based on the
proposed model compared to the original values for DM atDM at ADD benchmark with m =
690 on Medusa node, ran on different number of cores.
75
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 5.25. The results of predicting the execution time in terms of grain size based on the
proposed model compared to the original values for DM atDM at ADD benchmark with m =
4222 on Medusa node, ran on different number of cores.
76
Figure 5.26. The relative error of predicting the execution time in terms of grain size based on
the proposed model for DM atDM at ADD benchmark on Medusa node. The relative error
of matrix sizes smaller than 1097 was averaged for each specific number of cores.
Figure 5.27. The R2 score of predicting the execution time in terms of grain size based on the
proposed model for DM atDM at ADD benchmark on Medusa node. The relative error of
matrix sizes smaller than 1097 was averaged for each specific number of cores.
Although we had completely ignored cache effects up until here for simplification, Fig-
ure 5.21 suggests that these effects do not influence the location of the flat region in the graph,
which is the area with very small changes in execution time over which we observe the lowest
execution time.
Figure 5.28 and Figure 5.29 show the identified range for minimum execution time based
on Formula 5.10 for different matrix sizes ran on 8 cores on M ar vi n and Medusa nodes
respectively.
77
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.28. The identified range for minimum execution time based on the analytical model
vs the real data for DM atDM at ADD benchmark for m = 690, 912, 1825, 3193, 4222, 4855,
6420 on 8 cores on M ar vi n node, with λb = 0.01 and λs = 0.1.
78
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.29. The identified range for minimum execution time based on the analytical model
vs the real data for DM atDM at ADD benchmark for m =, for m = 690, 912, 1825, 3193, 4222,
4855, 6420, on 8 cores on Medusa node, with λb = 0.01 and λs = 0.1.79
5.7. Predicting the Range of Chunk Size for Minimum Execution Time
Having identified the range of minimum execution time for each matrix size, we can find
the range of chunk size that would result in that range of grain size, for a specific block size.
As stated before, block size of 4×256 was selected as a safe choice for DM atDM at ADD
benchmark. For each grain size in the identified range, the equivalent chunk size is calculated.
Figure 5.30 shows the identified range of chunk size for minimum execution time based on
Formula 5.10, for different matrix sizes ran on 8 cores, when bl ock_si ze = 4×256. Each point
on the graph is the real data collected from running the DM atDM at ADD benchmarks. The
red area denotes the proposed range for chunk size, and the red points are the data points that
lie in this region. As it can be seen the predicted range for chunk size are reasonable.
80
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.30. The suggested range of chunk size for minimum execution time with
bl ock_si ze = 4× 256, for DM atDM at ADD benchmark for m = 690, 912, 1825, 3193, 4222,
4855, 6420, on 8 cores on M ar vi n node, when λb = 0.1 and λs = 0.01.
81
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.31. The suggested range of chunk size for minimum execution time with
bl ock_si ze = 4× 256, for DM atDM at ADD benchmark for m = 690, 912, 1825, 3193, 4222,
4855, 6420, on 8 cores on Medusa node, when λb = 0.1 and λs = 0.01.
82
In the next step we tested the proposed approach on DM atDM atDM at ADD bench-
mark. The benchmark calculates D = A +B +C , where A, B , C , and D are dense row-major
square matrices. The main difference between this benchmark and the previously studied
DM atDM at ADD benchmark is in the number of floating point operations executed. For
this problem ps = 2m2 instead of m2 observed for DM atDM at ADD benchmark. The other
issue that should be noted is that since 4 matrices are involved in this benchmark the require-
ment fr matrix sizes to fit into cache would be different. Running DM atDM atDM at ADD
benchmark, matrix sizes greater than 810 for M ar vi n node and 950 for Medusa node, would
experience performance degradation due to cache effects.
83
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.32. The identified range for minimum execution time based on the analytical model
vs the real data for DM atDM atDM at ADD benchmark for m = 690, 912, 1825, 3193, 4222,
4855, 6420, on 8 cores on M ar vi n node, with λb = 0.01 and λs = 0.1.
84
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.33. The identified range for minimum execution time based on the analytical model
vs the real data for DM atDM atDM at ADD benchmark for m = 690, 912, 1825, 3193, 4222,
4855, 6420, on 8 cores on Medusa node, with λb = 0.01 and λs = 0.1.
85
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 4855
(g) m = 6420
Figure 5.34. The suggested range of chunk size for minimum execution time with
bl ock_si ze = 4×256 for DM atDM atDM at ADD benchmark, for m = 690, 912, 1825, 3193,
4222, 4855, 6420, on 8 cores on M ar vi n node, when λb = 0.1 and λs = 0.01.
86
(a) m = 690 (b) m = 912
(c) m = 1825 (d) m = 3193
(e) m = 4222 (f) m = 690
(g) m = 6420
Figure 5.35. The suggested range of chunk size for minimum execution time with
bl ock_si ze = 4× 256for DM atDM atDM at ADD benchmark, for m = 690, 912, 1825, 3193,
4222, 4855, 6420, on 8 cores on Medusa node, when λb = 0.1 and λs = 0.01.
87
5.8. Conclusions
In this section we offered an analytical model to predict the behavior of execution time in
terms grain size for a balanced parallel loop, and, based on that model, suggested a method
to estimate the range of grain size to ensure staying in the flat region of the bathtub curve of
execution time. The proposed model was then validated using a simple implemented bench-
mark. The parameters found through the benchmark was then utilized to predict the execu-
tion time of the DM atDM at ADD and DM atDM atDM at ADD benchmark from Bl azemar k
suite. We were also able to find the range of grain sizes to achieve minimum execution time.
At this point by selecting a reasonable value for block size, the range of chunk size to ensure
staying in that range of grain size could be identified.
We propose to run the parallel for-loop benchmark in Listing 5.1 for a reasonable ps =
100000, with different values for chunk size on different number of cores, and use the collected
data to estimate the α and σ parameters through curve fitting. This process could be added
as an extra step to building process of Blaze. The complexity of the expression to be evaluated
is then calculated at compile time as the number of floating point operations.
A block size should also be selected based on the available cache size, the cache line, and
the number of matrices involved in the operation. This part could also be done at compile
time. At runtime, based on the matrix size and number of cores chosen to run the program
on, the range of chunk size to achieve the lowest execution time for each specific expression
is identified.
It should be mentioned that although the proposed formula for execution time vs grain
size depends on the sequential execution time, the proposed range for minimum execution
time does not rely on sequential execution time, but depends on total amount of work, num-
ber of cores, overhead of creating and managing tasks, and the values we choose for the
threshold on both sides(λb and λs).
In order to evaluate our proposed approach we chose the middle point of the collected
chunk sizes within the range as our suggested chunk size, and compared the execution time
88
obtained with this grain size with the minimum achievable execution time among with all
different chunk sizes, and the chunk size that would create as many tasks as the number of
cores(represented with equal). Speedup is calculated as the ratio of equal or min execution
time to the excution time of the proposed chunk size.
Table 5.2 shows the percentage of the individual runs(a specific matrix size and number of
cores) for which an improvement from the execution time when as many tasks as the number
of cores as created, along with the percentage of runs for which the observed execution time is
within the 10% range of the minimum execution time. The collected data shows that using the
analytical model to find the optimum grain size helps us improve the achievable performance
through creating as many tasks as the number of cores, in most of the cases. Results of running
the benchmark on 1 core was excluded from the calculations.
Table 5.2. Comparing the obtained results from using the analytical model to find the opti-
mum grain size with minimum execution time on M ar vi n node
Node Benchmark Improvement from equal(%) Within 10% of the minimum execution time
M ar vi n DM atDM at ADD 57% 68%
M ar vi n DM atDM atDM at ADD 58% 70%
Medusa DM atDM at ADD 63% 62%
Medusa DM atDM atDM at ADD 64% 60%
89
(a) M ar vi n node
(b) Medusa node
Figure 5.36. The average speedup compared to creating as many tasks as number of
cores(equal) and the minimum execution time(min) for DM atDM at ADD benchmark on
M ar vi n and Medusa node. The speedup was averaged over all matrix sizes for each spe-
cific number of cores.
90
(a) M ar vi n node
(b) Medusa node
Figure 5.37. The average speedup compared to creating as many tasks as number of
cores(equal) and the minimum execution time(min) for DM atDM atDM at ADD benchmark
on M ar vi n and Medusa node. The speedup was averaged over all matrix sizes for each spe-
cific number of cores.
91
Chapter 6. Runtime Task Granularity Control through Splittable Tasks
6.1. Splittable Tasks
In the previous chapter we proposed an analytical model to estimate the execution time
of a balanced parallel for loop in terms of the grain size. Based on this model, we offered an ap-
proach to find the range of grain sizes to achieve minimum execution time. The parameters of
the proposed model are identified through a benchmark and are exposed to the Blaze library
to predict the range of grain sizes for minimum execution time of a problem at run-time. So
the proposed method is a combination of compile-time and run-time solution to improve the
performance.
In this chapter we utilize our knowledge from identifying the optimum range of grain size
based on the analytical model to extend a previously implemented algorithm for controling
task granularity.
Utilizing splittable tasks is a runtime adaptive method for managing task granularity, to
avoid the large overhead of creating and managing too many tasks due to the fine grain par-
allelism on one hand, and the starvation resulted from creating less tasks than the available
parallelism on the other hand.
Splittable tasks are tasks that could be partitioned into smaller tasks, when sufficient par-
allelism is available [48].
Prell [48] intensively studies using splittable tasks for runtime adaptivity. They start by
offering steal-half strategy for work stealing in order to steal half of the tasks from a worker’s
queue at each steal attempt instead of just one task. This could help to avoid creating too
much work stealing overhead. But depending on the application, steal-one or steal-half might
be the preferable method for work stealing. They propose an adaptive method to decide on
whether to use steal-one or steal-half strategy at runtime, based on the current selected strat-
egy and the ratio of the number of executed tasks(M) within the last N steals, where N is
considered the evaluation interval [48].
Next, based on lazy task scheduling [49], they suggest instead of creating at the tasks and
92
deciding on how to the workers should steal them, we can create one task and let it split if
needed. This way you wouldn’t have to deal with the overhead of scheduling tasks along with
the overhead of stealing the tasks.
At each split, the original task is split into two parts. One part would be executed by the
current worker, while the other part would be added to the current worker’s deque as a split-
table task to be stolen by idle workers. The split would only occur if the number of iterations
assigned to the task is greater than a parameter K , which is a factor of the total number of
iterations divided by the number of the workers [48].
There are different strategies for splitting, namely, Binary, Guided, and Adaptive splitting.
In binary splitting [49], the task is split into two same sized tasks, while in guided splitting
1
chunks of the task is executed by the current worker and the rest is added to the worker’s deque.
The chunks parameter is initialized to the number of workers and is decremented at each
split. This type of split would be beneficial if many of the workers are idle. Adaptive splitting
on the other hand, initializes chunks parameter to number of workers like guided splitting,
but adaptively changes it to the number of idle workers at each split [48].
Our work here is based on Prell [48]’s definition of splittable tasks and their split strategies.
We have implemented an executor within HPX which would create splittable tasks to execute
the work, and we improved upon their work in following ways:
• At each split, instead of allowing the created splittable task to be stolen by free workers,
we turn work stealing off, identify the idle workers and explicitly assign the work to one
of them. This way we avoid the work stealing overhead originated from unsuccessful
steal attempts.
• We suggest Utilizing our proposed method for identifying the optimum range of grain
size for minimum execution time, and use the lower-bound of the range as a cutoff value
specific to system architecture and the application being executed, to stop splitting.
93
6.2. Implementation
For a parallel for-loop with the range of [a,b), one splittable task containing all the itera-
tions from a to (b −1) would be created. Depending on the splitting strategy when a certain
condition is met this task would be split into two tasks, T1 containing iterations in the range
of [a,c) and T2 would contain iterations from [c,b), where c is the split point. T1 would be
executed by the current worker while T2, which is also a splittable task, would be scheduled to
run on another core. This allows the runtime to adaptively decide whether to split the current
task into smaller tasks or just run it.
The main difference between guided and adaptive splitting is the size of the task to be
executed by the current thread and how the generated task is scheduled.
In guided splitting the split iteration index c is calculated as c = b−aN , where N is the total
number of cores. The generated task, which is also splittable, would then be assigned to the
next available worker. On the other hand, in adaptive splitting c = b−aM+1 where M is the number
of idle cores at the time split occurred. The created splittable task is then scheduled to be exe-
cuted by the first identified idle core. The idle cores can be identified through an API function
in HPX and are queried at the time of split. The API function returns a bitmask with the size of
the number of cores, where a 1 at i th bit shows that the i th core is idle while a 0 represents a
busy core.
In Adaptive splitting, after querying the idle cores, starting from the least significant bit,
the first bit with value of 1 would be the core the task would be scheduled on. In order to make
implicit scheduling of the work on the cores possible, and consequently to avoid the unnec-
essary overhead associated with work stealing, the splittable executor turns work stealing off
automatically. Table 6.1 summarizes these two splitting strategies.
94
Table 6.1. Task split and scheduling in Guided and Adaptive splitting. a is the index of the first
iteration, b is the last iteration, and c is the split location.
Guided Splitting Adaptive Splitting
c = b−aN where N is the number of cores c = b−aM+1 where M is the number of idle cores
Task T1 contains range [a, c) Task T1 contains range [a, c)
executed by the current worker executed by the current worker
Task T2 contains range [c, b) Task T2 contains range [c, b)
executed by the next available worker scheduled to be executed by
the first identified idle worker
Figure 6.1 and Figure 6.2 illustrate the results of running the for-loop benchmark in List-
ing 5.1 using the splittable executor for ps = 100000 and ps = 100000000, in two modes guided
and adaptive as shown in Table 6.1 on M ar vi n node. The results show that the splittable
executor was successful in controlling the task granularity, in the sense that the obtained exe-
cution time was in the range of the optimum achievable execution time by changing the grain
size. As it can be seen, the results for guided and adaptive splitting are very close to each other,
since the benchmark forces the execution time of each iteration to be fixed.
Figure 6.3 and Figure 6.4 shows the results of running the for-loop benchmark using the
splittable executor for ps = 100000 and ps = 100000000, on Medusa node.
95
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.1. The results of running the for-loop benchmark using splittable tasks with guided
and adaptive splitting for ps = 100000 on M ar vi n node, for different number of cores.
96
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.2. The results of running the for-loop benchmark using splittable tasks with guided
and adaptive splitting for ps = 100000000 on M ar vi n node, for different number of cores.
97
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.3. The results of running the for-loop benchmark using splittable tasks with guided
and adaptive splitting for ps = 100000 on Medusa node, for different number of cores.
98
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.4. The results of running the for-loop benchmark using splittable tasks with guided
and adaptive splitting for ps = 100000000 on Medusa node, for different number of cores.
99
6.3. Using the Proposed Analytical Model to Estimate the Stop Splitting Threshold
Intel’s TBB(threading Building Blocks) [50] utilizes two strategies for work stealing, Sin-
gle Partitioner(SP) and Adaptive Partitioner(AP). SP introduces a stop splitting threshold(sst)
which uses a parameter identified by the programmer to stop splitting. A task would not split
if the number of iterations in the task is smaller than this threshold [49].
Auto partitioner, on the other hand, adds runtime adaptivity by creating K ×P chunks,
where P is the number of cores [51]. They assign a variable n to each range, initializing it to
K ×P , every time a split occurs n is halved. They also add another constant parameter V , each
time a task is stolen the value of n for that task is required to be at least V . Currently both K
and V are set to 4 in their implementation [51].
Instead of setting one static threshold for splitting for every application, [48] suggest to
stop splitting when the number of the remaining iterations gets smaller than a threshold K
which is a fraction of the number of iterations divided by the number of cores, for runtime
adaptivity.
Here we propose to utilize the identified optimum range of grain sizes through our ana-
lytical model to adjust the split threshold based on the system architecture and the specific
application being executed. For this purpose, a member variable with default value of 0, was
added to the splittable task implementation denoting the minimum number of iterations each
created task should include. On the other side, the implementation of the HPX backend for
Blaze was changed to create splittable tasks with a minimum size(number of iterations) cal-
culated at runtime for each expression being evaluated specific to the system architecture.
Listing 6.1 shows the modified implementation of the HPX backend for Blaze.
Figures 6.5 and 6.6 show the results of applying the threshold calculated through the
analytical model for two matrix sizes, 264, 1825, for DM atDM at Add benchmark from Blaze-
mark. As it can be seen, the effect of applying the threshold is more noticeable for smaller
problem sizes, where the penalty of creating too many tasks is considerable compared to the
total execution time.
100
Listing 6.1: New implementation of Assign function for HPX backend in Blaze.
1 template< typename MT1 // Type of the l e f t −hand side dense matrix
2 , bool SO1 // Storage order of the l e f t −hand side dense matrix
3 , typename MT2 // Type of the right−hand side dense matrix
4 , bool SO2 // Storage order of the right−hand side dense matrix
5 , typename OP > // Type of the assignment operation
6 void hpxAssign ( DenseMatrix<MT1, SO1>& lhs , const DenseMatrix<MT2, SO2>& rhs , OP op )
7 {
8 using hpx : : p a r a l l e l : : for_loop ;
9 using hpx : : p a r a l l e l : : execution : : par ;
10 BLAZE_FUNCTION_TRACE;
11 using ET1 = ElementType_t<MT1>;
12 using ET2 = ElementType_t<MT2>;
13 constexpr bool simdEnabled ( MT1 : : simdEnabled && MT2 : : simdEnabled && IsSIMDCombinable_v<ET1 , ET2> ) ;
14 constexpr s i z e _ t SIMDSIZE( SIMDTrait< ElementType_t<MT1> > : : s i z e ) ;
15 const bool lhsAligned ( (~ lhs ) . isAligned ( ) ) ;
16 const bool rhsAligned ( (~ rhs ) . isAligned ( ) ) ;
17 const s i z e _ t threads ( getNumThreads ( ) ) ;
18 const s i z e _ t numRows ( min( s t a t i c _ c a s t <std : : s ize_t >( BLAZE_HPX_MATRIX_BLOCK_SIZE_ROW ) ,(~ rhs ) . rows ( ) ) ) ;
19 const s i z e _ t numCols ( min( s t a t i c _ c a s t <std : : s ize_t >( BLAZE_HPX_MATRIX_BLOCK_SIZE_COLUMN ) , (~ rhs ) . columns
( ) ) ) ;
20 const s i z e _ t rest1 ( numRows & ( SIMDSIZE − 1UL ) ) ;
21 // const s i z e _ t rowsPerIter ( ( simdEnabled && rest1 ) ? ( numRows − rest1 + SIMDSIZE ) : ( numRows ) ) ;
22 const s i z e _ t rowsPerIter ( numRows ) ;
23 const s i z e _ t addon1 ( ( ( (~ rhs ) . rows ( ) % rowsPerIter ) != 0UL ) ? 1UL : 0UL ) ;
24 const s i z e _ t equalShare1 ( (~ rhs ) . rows ( ) / rowsPerIter + addon1 ) ;
25 const s i z e _ t rest2 ( numCols & ( SIMDSIZE − 1UL ) ) ;
26 const s i z e _ t c o l s P e r I t e r ( ( simdEnabled && rest2 ) ? ( numCols − rest2 + SIMDSIZE ) : ( numCols ) ) ;
27 const s i z e _ t addon2 ( ( ( (~ rhs ) . columns ( ) % c o l s P e r I t e r ) ! = 0UL ) ? 1UL : 0UL ) ;
28 const s i z e _ t equalShare2 ( (~ rhs ) . columns ( ) / c o l s P e r I t e r + addon2 ) ;
29
30 hpx : : p a r a l l e l : : execution : : splittable_mode sptMode = hpx : : p a r a l l e l : : execution : : splittable_mode : : a l l ;
31 s i z e _ t minChunkSize = 0 ;
32 i f ( BLAZE_SPLIT_ADAPTIVE )
33 {
34 double alpha = s t a t i c _ c a s t <double >( BLAZE_HPX_PARAM_ALPHA ) ;
35 double lambdaB = s t a t i c _ c a s t <double >( BLAZE_HPX_PARAM_LAMBDA_B ) ;
36 minChunkSize = c e i l ( c e i l ( sqrt ( alpha * (~ rhs ) . rows ( ) * (~ rhs ) . columns ( ) / ( threads * lambdaB ) ) ) / (
numRows * numCols ) ) ;
37 }
38 i f (BLAZE_HPX_SPLIT_TYPE_IDLE == 1 )
39 {
40 sptMode = hpx : : p a r a l l e l : : execution : : splittable_mode : : i d l e ;
41 }
42 hpx : : p a r a l l e l : : execution : : s pl i t t a b l e _ e x e c u to r spt ( sptMode , minChunkSize ) ;
43 for_loop ( par . on( spt ) , s i z e _ t ( 0 ) , equalShare1 * equalShare2 , [ & ] ( i n t i )
44 {
45 const s i z e _ t row ( ( i / equalShare2 ) * rowsPerIter ) ;
46 const s i z e _ t column( ( i % equalShare2 ) * c o l s P e r I t e r ) ;
47 i f ( row >= (~ rhs ) . rows ( ) | | column >= (~ rhs ) . columns ( ) )
48 return ;
49 const s i z e _ t m( min( rowsPerIter , (~ rhs ) . rows ( ) − row ) ) ;
50 const s i z e _ t n( min( colsPerIter , (~ rhs ) . columns ( ) − column ) ) ;
51 i f ( simdEnabled && lhsAligned && rhsAligned ) {
52 auto t a r g e t ( submatrix<aligned >( ~lhs , row , column , m, n ) ) ;
53 const auto source ( submatrix<aligned >( ~rhs , row , column , m, n ) ) ;
54 op( target , source ) ;
55 }
56 else i f ( simdEnabled && lhsAligned ) {
57 auto t a r g e t ( submatrix<aligned >( ~lhs , row , column , m, n ) ) ;
58 const auto source ( submatrix<unaligned >( ~rhs , row , column , m, n ) ) ;
59 op( target , source ) ;
60 }
61 else i f ( simdEnabled && rhsAligned ) {
62 auto t a r g e t ( submatrix<unaligned >( ~lhs , row , column , m, n ) ) ;
63 const auto source ( submatrix<aligned >( ~rhs , row , column , m, n ) ) ;
64 op( target , source ) ;
65 }
66 else {
67 auto t a r g e t ( submatrix<unaligned >( ~lhs , row , column , m, n ) ) ;
68 const auto source ( submatrix<unaligned >( ~rhs , row , column , m, n ) ) ;
69 op( target , source ) ;
70 }
71 } ) ;
72 }
101
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.5. The results of running the DM atDM at Add benchmark using splittable tasks with
adaptive and thresholded adaptive splitting for matrix size 264 on M ar vi n node, for different
number of cores.
102
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.6. The results of running the DM atDM at Add benchmark using splittable tasks with
adaptive and thresholded adaptive splitting for matrix size 1825 on M ar vi n node, for differ-
ent number of cores.
103
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.7. The results of running the DM atDM at Add benchmark using splittable tasks with
adaptive and thresholded adaptive splitting for matrix size 264 on Medusa node, for different
number of cores.
104
(a) 1 core (b) 2 cores
(c) 3 cores (d) 4 cores
(e) 5 cores (f) 6 cores
(g) 7 cores (h) 8 cores
Figure 6.8. The results of running the DM atDM at Add benchmark using splittable tasks with
adaptive and thresholded adaptive splitting for matrix size 1825 on Medusa node, for differ-
ent number of cores.
Figure 6.9 compares the obtained results using the splittable executor in guided, adaptive,
105
guided with threshold, adaptive with threshold, and creating as many tasks as the number of
cores by dividing the total amount of work equally among the cores(denoted with "equal"),
while Figure 6.10 compares he results with the minimum execution time observed from the
collected data with different grain sizes. In Figure 6.9 the speedup denotes the ratio of the exe-
cution time in the corresponding mode and the observed execution time when same number
of tasks as the number of cores are created, while in Figure 6.10 speedup is the ratio of the ex-
ecution time to the minimum execution time collected. In the plots, the speedup is averaged
over all the matrix sizes for each specific number of cores.
106
(a) M ar vi n node
(b) Medusa node
Figure 6.9. The average speedup compared to creating as many tasks as number of cores(equal
mode) for different modes for DM atDM at ADD benchmark on M ar vi n and Medusa node.
The speedup was averaged over all matrix sizes for each specific number of cores.
107
(a) M ar vi n node
(b) Medusa node
Figure 6.10. The average speedup compared to the minimum execution time for different
modes for DM atDM at ADD benchmark on M ar vi n and Medusa node. The speedup was
averaged over all matrix sizes for each specific number of cores.
108
(a) M ar vi n node
(b) Medusa node
Figure 6.11. The average speedup compared to creating as many tasks as number of
cores(equal mode) for different modes for DM atDM atDM at ADD benchmark on M ar vi n
and Medusa node. The speedup was averaged over all matrix sizes for each specific number
of cores.
109
(a) M ar vi n node
(b) Medusa node
Figure 6.12. The average speedup compared to the minimum execution time for different
modes for DM atDM atDM at ADD benchmark on M ar vi n and Medusa node. The speedup
was averaged over all matrix sizes for each specific number of cores.
Based on the collected data, using splittable tasks helps us improve the performance
achieved through creating as many tasks as the number of cores, in most of the cases, for
all the guided, guided with threshold, adaptive, and adaptive with threshold modes, as shown
in Table 6.2 and Table 6.3 for DM atDM at ADD benchmark and Table 6.4 and Table 6.5 for
DM atDM atDM at ADD benchmark. Results of running the benchmark on 1 core was ex-
cluded from the calculations.
110
Table 6.2. Comparing the obtained results from splittable executor with minimum execution
time for DM atDM at ADD benchmark on M ar vi n node
mode Improvement from equal(%) Within 10% of the minimum execution time
adaptive 86% 90%
adaptive with threshold 73% 84%
guided 94% 90%
guided with threshold 83% 87%
Table 6.3. Comparing the obtained results from splittable executor with minimum execution
time for DM atDM at ADD benchmark on Medusa node
mode Improvement from equal(%) Within 10% of the minimum execution time
adaptive 83% 69%
adaptive with threshold 83% 66%
guided 93% 66%
guided with threshold 93% 68%
Table 6.4. Comparing the obtained results from splittable executor with minimum execution
time for DM atDM atDM at ADD benchmark on M ar vi n node
mode Improvement from equal(%) Within 10% of the minimum execution time
adaptive 91% 87%
adaptive with threshold 92% 87%
guided 97% 91%
guided with threshold 97% 91%
Table 6.5. Comparing the obtained results from splittable executor with minimum execution
time for DM atDM atDM at ADD benchmark on Medusa node
mode Improvement from equal(%) Within 10% of the minimum execution time
adaptive 84% 71%
adaptive with threshold 89% 70%
guided 93% 72%
guided with threshold 93% 72%
111
Chapter 7. Related Work
In this chapter we briefly study some of the related research efforts in this dissertation’s
area of focus.
Loop scheduling techniques have been extensively studied by different researchers. In
[52] the authors propose a hybrid static/dynamic method for loop scheduling that improves
the performance of dense matrix factorization, compared to both fully static and fully dy-
namic scheduling. The authors of [52], divide the dependency graph into two subgraphs, one
of which is scheduled dynamically and the other one is scheduled statically. The tasks on the
critical path are scheduled statically and each thread is forced to prioritize the static tasks[52].
They were able to improve data locality and scheduling overhead, while creating a more bal-
anced workload.
Recently there have been efforts to use machine learning to predict the execution time
or throughput in parallel applications. Khatami et al. [53] defined a set of features that could
be extracted from the compiler including deepest loop level, total number of operations per
iteration, number of floating and overall operations per iteration, and a set of runtime features
including number of threads, and number of iterations. They developed a logistic regression
model to predict the optimum chunk size of a parallel for-loop, from the options 0.1%, 1%,
10%, and 50% of the total number of iterations.
In an attempt similar to our work, Laberge et al. [54] use machine learning to estimate
the optimum chunk size for evaluation of different array operations and array sizes in Blaze
library. Defining the features as array size, number of cores, operation type, and machine ar-
chitecture, the authors developed two random forest models. One, as a classification model,
would predict the optimum chunk size; while the other one, as a regression model, predicts
the achievable performance. The optimum chunk size, would then be identified through find-
ing the predicted performance for each chunk size and selecting the chunk size that would re-
sult in the maximum performance. In the classification model, the random forest model was
modified to use the difference in real and predicted values for throughput as the loss func-
112
tion, which was helpful in improving the achieved accuracy. The authors also integrated their
model into a fork of Blaze library in order to use the model predictions to set the chunk size
at runtime. The drawbacks of this method were that the predicted chunk size was limited to
the set 1,2,3, ...,10 which is not always a good choice, and also their method required collect-
ing a considerable amount of data from different architectures in order to be generalized their
model [54].
As another field to use machine learning, [55] collects seven runtime events and uses ma-
chine learning not to predict the performance, but to schedule the tasks. These events in-
clude, task creation, suspension, execution, completion, implicit/explicit barrier, parallel re-
gion, and finally loop/master/single region runtime events, collected through the OMPT using
ORA API. Experimenting with four different machine learning techniques, including support
vector machine, random forest, neural networks, and naive bayes, they would select one spe-
cific task pool configuration out of the three pre-defined options as the final classification
result. Testing this framework on a real life molecular dynamics application, they observed an
up to 31% improvement in performance.
The authors of [56] propose using machine learning to predict the optimal number of
threads, and also the optimal scheduling policy for running an OpenMP application. Through
that, they were able to develop an automatic compiler-based method to map a parallel appli-
cation to a multicore processor. They collect three types of features namely, code, data, and
runtime features. Code features are extracted from the code directly, and they include cycles
per instruction, number of branches, load and store instructions, and computations per in-
struction. While the code features could be collected statically at compile time, the data and
run-time features are collected through low-cost profiling runs. This group of features include
loop iteration count, branch miss rate, and L1 data cache miss rate. The authors then use an
artificial neural network to predict the speedup achieved for a program with certain number
of threads, and at the same time they use a support vector machine model to predict the best
scheduling policy, out of block, cyclic, dynamic, and guided scheduling policies, for an unseen
113
program[56].
In [57], the authors offer a combination of compile-time and run-time solution for adap-
tive control of task granularity. They create multiple transformed versions of the code with
different levels of task unrolling at compile time and then use a heuristic based on task de-
mand (the number of unsuccessful steal attempts by other workers) and each worker’s queue
length, to select one of the versions each time a new task is spawned[57]. Their solution relies
completely on the compiler and the run-time, and eliminates the need for manual support.
In [58], the authors try to find the optimum task granularity to keep the scheduling over-
heads and resource utilization in balance in an asynchronous many-task runtime system.
They utilize a system emulator to study the effect of task granularity in system performance,
and conclude that the optimum task size with representative schedulers is between 1.2x104
and 10x104 cycles on a system with up to 1024 cores. They also provide an automatic algo-
rithm to aggregate tasks into larger tasks based on the concluded task granularity in order to
improve the performance.
Grubel et al. [20] also studies the effect of the task size on performance of the HPX applica-
tions. They introduce some performance metrics in order to help them identify the optimum
grain size which could be very helpful for runtime adaptivity.
In [59], the authors use thresholds to decide on whether to inline a task at runtime. The
imposed threshold for task inlining on a specific architecture then converts into the problem
to what portion of the execution time of the task should be spent for scheduling the task,
so that it would be worth to be executed as a separate task. This is in compliance with our
findings in this paper for λb , as shown in (5.14), where we suggest in order to land in the
flat region of the execution time versus grain size graph, the ratio of the grain size over the
scheduling over head of one task one one core should be greater than a threshold.
Tzannes et al.[49] proposed Lazy Binary Splitting(LBS) that adjusts the available paral-
lelism based on the inferred load at run-time, to avoid the unnecessary parallelism overhead.
They work improve upon Intel’s TBB(threading Building Blocks) [50] implementation of Ea-
114
ger Binary Splitting(EBS), by first postponing splitting to when parallelism is available, and
second applying a runtime adaptive threshold to stop splitting.
115
Chapter 8. Conclusions
In this dissertation, our focus is on using task granularity for runtime adaptivity in asyn-
chronous many-task runtime systems, in order to improve the performance of multi-threaded
linear algebra libraries. We selected HPX, the C++ Standard library for parallelism and concur-
rency, and Blaze, a high performance C++ math library, for this purpose.
At the first step, using polynomial regression, we were able to provide an estimate for the
range of grain size for maximum throughput with a specific number cores. But this model
was not physical, so in the next step we provided an analytical model for execution time of
balanced parallel for-loops in terms of grain size and number of cores for the purpose of lo-
cating the range of grain sizes for minimum execution time. Some of the contributing factors
were ignored to simplify the model since it wouldn’t contribute to the located optimum range,
which was our main goal.
This model depends on two mostly architecture-specific parameters, the overhead of schedul-
ing tasks(α) and contention(σ). A simple parallel for-loop benchmark was developed to val-
idate the proposed model. We suggest that we can utilize this benchmark to find the model
parameters on a specific architecture by collecting a small set of data. Once the parameters
are identified, the analytical model could be used to estimate the optimum range of grain sizes
for any other balanced parallel for-loop application ran on the same machine.
The proposed approach was then applied to Blaze, a high performance C++ math library.
By defining grain size as the number of floating point operations we implemented a function
API to estimate the number of floating point operations for an expression at compile time.
This allows us to offer a runtime and compile-time solution to find the optimum range of
grain size on a specific architecture while evaluating a specific expression. The obtained re-
sults showed that the identified chunk size had improved the performance from the equal
case(where the same number of tasks as the number of cores are created) in more than half of
the cases.
Having tested and validated our proposed model, we extended a previously implemented
116
algorithm to apply the information extracted through the model in order to avoid the unnec-
essary scheduling overhead. We implemented a splittable executor with two modes, guided
and adaptive, in HPX. This executor would launch splittable tasks instead of a regular task,
which would generate more splittable tasks if enough parallelism is available. This helps to
avoid both unnecessary scheduling overhead and under-utilization of our resources. The re-
sults shows an improvement from the equal case, and also reaching an execution time within
10% of the minimum execution time in majority of the cases. In order to adjust the provided
solution to the running application, we used the optimum range identified using the analyti-
cal model as a threshold to stop splitting. The task splitting would stop if the grain size of the
created task gets smaller than the lower-bound of the identified range. This helped decreasing
the scheduling overhead of tasks for smaller problem sizes.
117
References
[1] Patricia Grubel, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. Using intrinsic perfor-
mance counters to assess efficiency in task-based parallel applications. In 2016 IEEE In-
ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages
1692–1701. IEEE, 2016.
[2] Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey.
Hpx: A task based programming model in a global address space. In Proceedings of the
8th International Conference on Partitioned Global Address Space Programming Models,
page 6. ACM, 2014.
[3] J Bennett, R Clay, G Baker, M Gamell, D Hollman, S Knight, H Kolla, G Sjaardema, N Slat-
tengren, K Teranishi, et al. Asynchronous many-task runtime system analysis and assess-
ment for next generation platforms. US Department of Energy, Sandia National Labora-
tories Report, Rep. no. SAND2015-8312, 2015.
[4] R Clinton Whaley and Jack J Dongarra. Automatically tuned linear algebra software. In
SC’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 38–38.
IEEE, 1998.
[5] Markus Puschel, José MF Moura, Jeremy R Johnson, David Padua, Manuela M Veloso,
Bryan W Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, et al. Spi-
ral: Code generation for dsp transforms. Proceedings of the IEEE, 93(2):232–275, 2005.
[6] Abhishek Kulkarni and Andrew Lumsdaine. A comparative study of asynchronous many-
tasking runtimes: Cilk, charm++, parallex and am++. arXiv preprint arXiv:1904.00518,
2019.
[7] Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. Openacc—first
experiences with real-world applications. In European Conference on Parallel Processing,
pages 859–870. Springer, 2012.
[8] Aaftab Munshi. The opencl specification. In 2009 IEEE Hot Chips 21 Symposium (HCS),
pages 1–314. IEEE, 2009.
[9] CUDA Nvidia. Nvidia cuda c programming guide. Nvidia Corporation, 120(18):8, 2011.
[10] James Reinders. Intel threading building blocks: outfitting C++ for multi-core processor
parallelism. " O’Reilly Media, Inc.", 2007.
[11] Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H
Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. Journal of
parallel and distributed computing, 37(1):55–69, 1996.
[12] Leonardo Dagum and Ramesh Menon. Openmp: An industry-standard api for shared-
memory programming. Computing in Science & Engineering, (1):46–55, 1998.
118
[13] MPI Mpi. A message passing interface standard. International Journal of Supercomputer
Applications, 8(3/4):165–414, 1994.
[14] Laxmikant V Kale and Sanjeev Krishnan. Charm++: a portable concurrent object oriented
system based on c++. In OOPSLA, volume 93, pages 91–108. Citeseer, 1993.
[15] Hartmut Kaiser, Maciek Brodowicz, and Thomas Sterling. Parallex an advanced parallel
execution model for scaling-impaired applications. In 2009 International Conference on
Parallel Processing Workshops, pages 394–401. IEEE, 2009.
[16] J Davison de St Germain, John McCorquodale, Steven G Parker, and Christopher R John-
son. Uintah: A massively parallel problem solving environment. In Proceedings the
Ninth International Symposium on High-Performance Distributed Computing, pages 33–
41. IEEE, 2000.
[17] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing lo-
cality and independence with logical regions. In SC’12: Proceedings of the International
Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–
11. IEEE, 2012.
[18] Hartmut Kaiser, Patrick Diehl, Adrian S Lemoine, Bryce Adelstein Lelbach, Parsa Amini,
Agustín Berge, John Biddiscombe, Steven R Brandt, Nikunj Gupta, Thomas Heller, et al.
Hpx-the c++ standard library for parallelism and concurrency. Journal of Open Source
Software, 5(53):2352, 2020.
[19] Guang R Gao, Thomas Sterling, Rick Stevens, Mark Hereld, and Weirong Zhu. Parallex:
A study of a new parallel computation model. In 2007 IEEE International Parallel and
Distributed Processing Symposium, pages 1–6. IEEE, 2007.
[20] Patricia Grubel, Hartmut Kaiser, Jeanine Cook, and Adrian Serio. The performance impli-
cation of task size for applications on the hpx runtime system. In 2015 IEEE International
Conference on Cluster Computing, pages 682–689. IEEE, 2015.
[21] Ananth Grama, Vipin Kumar, Anshul Gupta, and George Karypis. Introduction to parallel
computing. Pearson Education, 2003.
[22] Gene M Amdahl. Validity of the single processor approach to achieving large scale com-
puting capabilities. In Proceedings of the April 18-20, 1967, spring joint computer confer-
ence, pages 483–485. ACM, 1967.
[23] Neil J Gunther. What is guerrilla capacity planning? Springer, 2007.
[24] Neil J Gunther. A new interpretation of amdahl’s law and geometric scalability. arXiv
preprint cs/0210017, 2002.
[25] Neil J Gunther. The practical performance analyst. iuniverse. com inc. Lincoln, Nebraska,
2000.
119
[26] Florina M Ciorba, Christian Iwainsky, and Patrick Buder. Openmp loop scheduling revis-
ited: making a case for more schedules. In International Workshop on OpenMP, pages
21–36. Springer, 2018.
[27] Jie Liu, Vikram A Saletore, and Ted G Lewis. Safe self-scheduling: a parallel loop schedul-
ing scheme for shared-memory multiprocessors. International Journal of Parallel Pro-
gramming, 22(6):589–616, 1994.
[28] Teebu Philip. Increasing chunk size loop scheduling algorithms for data independent
loops. PhD thesis, Citeseer, 1995.
[29] Constantine D Polychronopoulos and David J Kuck. Guided self-scheduling: A prac-
tical scheduling scheme for parallel supercomputers. Ieee transactions on computers,
100(12):1425–1439, 1987.
[30] Ten H Tzen and Lionel M Ni. Trapezoid self-scheduling: A practical scheduling scheme
for parallel compilers. IEEE Transactions on parallel and distributed systems, 4(1):87–98,
1993.
[31] Susan Flynn Hummel, Edith Schonberg, and Lawrence E Flynn. Factoring: A method for
scheduling parallel loops. Communications of the ACM, 35(8):90–102, 1992.
[32] David J Lilja. Exploiting the parallelism available in loops. Computer, 27(2):13–26, 1994.
[33] Ali Mohammed, Ahmed Eleliemy, Florina M Ciorba, Franziska Kasielke, and Ioana Ban-
icescu. Experimental verification and analysis of dynamic loop scheduling in scientific
applications. In 2018 17th International Symposium on Parallel and Distributed Com-
puting (ISPDC), pages 141–148. IEEE, 2018.
[34] Robert D Blumofe and Charles E Leiserson. Scheduling multithreaded computations by
work stealing. Journal of the ACM (JACM), 46(5):720–748, 1999.
[35] Klaus Iglberger, Georg Hager, Jan Treibig, and Ulrich Rüde. Expression templates revis-
ited: a performance analysis of current methodologies. SIAM Journal on Scientific Com-
puting, 34(2):C42–C69, 2012.
[36] Todd Veldhuizen. Expression templates. C++ Report, 7(5):26–31, 1995.
[37] Blitz++ Library. http://www.oonumerics.org/blitz/.
[38] Jörg Walter and Mathias Koch. The boost ublas library, 2002.
[39] MTL4 Library. http://www.simunova.com/de/mtl4.
[40] Gaël Guennebaud, Benoit Jacob, et al. Eigen. URl: http://eigen. tuxfamily. org, 2010.
[41] Boost C++ Framework. https://www.boost.org.
[42] Gmm++ Library. http://getfem.org/gmm/.
120
[43] Conrad Sanderson and Ryan Curtin. Armadillo: a template-based c++ library for linear
algebra. Journal of Open Source Software, 1(2):26, 2016.
[44] Eigen Library. http://eigen.tuxfamily.org/.




[47] Intel Math Kernel Library. https://software.intel.com/en-us/mkl.
[48] Andreas Prell. Embracing explicit communication in work-stealing runtime systems. PhD
thesis, 2016.
[49] Alexandros Tzannes, George C Caragea, Uzi Vishkin, and Rajeev Barua. Lazy scheduling:
A runtime adaptive scheduler for declarative parallelism. ACM Transactions on Program-
ming Languages and Systems (TOPLAS), 36(3):1–51, 2014.
[50] David Padua, editor. TBB (Intel Threading Building Blocks), pages 2029–2029. Springer
US, Boston, MA, 2011.
[51] Arch Robison, Michael Voss, and Alexey Kukanov. Optimization via reflection on work
stealing in tbb. In 2008 IEEE International Symposium on Parallel and Distributed Pro-
cessing, pages 1–8. IEEE, 2008.
[52] Simplice Donfack, Laura Grigori, William D Gropp, and Vivek Kale. Hybrid static/dy-
namic scheduling for already optimized dense matrix factorization. In 2012 IEEE 26th
International Parallel and Distributed Processing Symposium, pages 496–507. IEEE, 2012.
[53] Zahra Khatami, Lukas Troska, Hartmut Kaiser, J Ramanujam, and Adrian Serio. Hpx
smart executors. In Proceedings of the Third International Workshop on Extreme Scale
Programming Models and Middleware, page 3. ACM, 2017.
[54] Gabriel Laberge, Shahrzad Shirzad, Patrick Diehl, Hartmut Kaiser, Serge Prudhomme,
and A Lemoine. Scheduling optimization of parallel linear algebra algorithms using su-
pervised learning. arXiv preprint arXiv:1909.03947, 2019.
[55] Ahmad Qawasmeh, Abid M Malik, and Barbara M Chapman. Adaptive openmp task
scheduling using runtime apis and machine learning. In 2015 IEEE 14th International
Conference on Machine Learning and Applications (ICMLA), pages 889–895. IEEE, 2015.
[56] Zheng Wang and Michael FP O’Boyle. Mapping parallelism to multi-cores: a machine
learning based approach. In ACM Sigplan notices, volume 44, pages 75–84. ACM, 2009.
[57] Peter Thoman, Herbert Jordan, and Thomas Fahringer. Adaptive granularity control in
task parallel programs using multiversioning. In European Conference on Parallel Pro-
cessing, pages 164–177. Springer, 2013.
121
[58] Dana Akhmetova, Gokcen Kestor, Roberto Gioiosa, Stefano Markidis, and Erwin Laure.
On the application task granularity and the interplay with the scheduling overhead in
many-core shared memory systems. In 2015 IEEE International Conference on Cluster
Computing, pages 428–437. IEEE, 2015.
[59] Bibek Wagle, Mohammad Alaul Haque Monil, Kevin Huck, Allen D Malony, Adrian Serio,
and Hartmut Kaiser. Runtime adaptive task inlining on asynchronous multitasking run-
time systems. In Proceedings of the 48th International Conference on Parallel Processing,
pages 1–10, 2019.
122
Appendix A. Implementation of the Splittable Task and Executor in HPX
Listing A.1: Implementation of splittable task in HPX
1 // Copyright ( c ) 2007−2020 Hartmut Kaiser
2 // Copyright ( c ) 2020 Shahrzad Shirzad
3 //
4 // SPDX−License−I d e n t i f i e r : BSL−1.0
5 // Distributed under the Boost Software License , Version 1 . 0 . ( See accompanying
6 // f i l e LICENSE_1_0 . t x t or copy at http : / /www. boost . org /LICENSE_1_0 . t x t )
7
8 # i fndef HPX_SPLITTABLE_TASK_HPP
9 #define HPX_SPLITTABLE_TASK_HPP
10 #include <hpx/ execution / d e t a i l / post_policy_dispatch . hpp>
11 #include <hpx/modules/ datastructures . hpp>
12 #include <hpx/modules/ execution . hpp>
13 #include <hpx/modules/ runtime_local . hpp>
14 #include <hpx/modules/ synchronization . hpp>
15 #include <hpx/ threading_base / thread_description . hpp>
16 #include <cstddef >
17 #include <cstdint >
18 #include < t y p e _ t r a i t s >
19 #include < u t i l i t y >
20 namespace hpx { namespace p a r a l l e l { namespace execution {
21 enum c l a s s splittable_mode
22 {
23 a l l = 0 ,
24 i d l e = 1 ,
25 } ;
26 i n l i n e char const * const get_splittable_mode_name ( splittable_mode mode)
27 {
28 s t a t i c constexpr char const * const splittable_mode_names [ ] = {
29 " a l l " , " i d l e " } ;
30 return splittable_mode_names [ s t a t i c _ c a s t <std : : s ize_t >(mode) ] ;
31 }
32 // / /// // /// // /// /// // /// /// // /// /// // /// /// // /// /// // /// /// // /// /// // /// ///
33 template <typename Executor , typename F>
34 s t r u c t s p l i t t a b l e _ t a s k
35 {
36 template <typename F_ , typename Shape>
37 s p l i t t a b l e _ t a s k ( Executor& exec , F_&& f , Shape const& elem ,
38 std : : s i z e _ t num_free , splittable_mode spl i t_type ,
39 std : : s i z e _ t min_task_size )
40 : s t a r t _ ( hpx : : u t i l : : get <0>(elem ) )
41 , stop_ ( hpx : : u t i l : : get <1>(elem ) )
42 , index_ ( hpx : : u t i l : : get <2>(elem ) )
43 , num_free_ ( num_free )
44 , f_ ( std : : forward<F_>( f ) )
45 , exec_ ( exec )
46 , s p l i t _ t y p e _ ( s p l i t _ t y p e )




51 void operator ( ) ( hpx : : latch * outer_latch = nul lptr )
52 {
53 i f ( s p l i t _ t y p e _ == splittable_mode : : i d l e )
54 {




59 c a l l ( ) ;
60 }
61
62 // no t i fy outer waiting task
63 i f ( outer_latch ! = nul lptr )
64 {





70 void c a l l _ i d l e ( )
71 {
72 auto mask = hpx : : threads : : get_idle_core_mask ( ) ;
73 num_free_ = hpx : : threads : : count (mask) ;
74
123
75 hpx : : latch l ( 2 ) ;
76
77 std : : s i z e _ t task_size =
78 std : : c e i l ( f l o a t ( stop_ − s t a r t _ ) / ( num_free_ + 1) ) ;
79 std : : s i z e _ t remainder = stop_ − s t a r t _ − task_siz e ;
80
81 hpx : : u t i l : : thread_description desc ( f_ ) ;
82
83 i f ( ( num_free_ ! = 0 && task_size > min_task_size_ ) && remainder > 0)
84 {
85 // s p l i t the current task , create one for the f i r s t i d e n t i f i e d i d l e core
86 for ( std : : s i z e _ t i = 0 , j = 0 ; i != 1 ; ++ j )
87 {
88 i f ( hpx : : threads : : t e s t (mask , j ) )
89 {
90 // pass schedule hint to place new task on empty core
91 using policy = hpx : : launch : : async_policy ;
92 d e t a i l : : post_policy_dispatch <policy > : : c a l l ( policy { } ,
93 desc , exec_ . g e t _ p r i o r i t y ( ) , exec_ . get_stacksize ( ) ,
94 hpx : : threads : : thread_schedule_hint ( std : : int16_t ( j ) ) ,
95 s p l i t t a b l e _ t a s k ( exec_ , f_ ,
96 hpx : : u t i l : : make_tuple (
97 stop_ − remainder , stop_ , index_ + i + 1) ,
98 num_free_ , spl i t_type_ , min_task_size_ ) ,
99 &l ) ;
100
101 stop_ = stop_ − remainder ;






108 l . count_down ( 1 ) ;
109 }
110
111 f_ ( hpx : : u t i l : : make_tuple ( s t a r t _ , stop_ − s t a r t _ , index_ ) ) ;
112
113 // wait for task scheduled above
114 l . arrive_and_wait ( 1 ) ;
115 }
116
117 void c a l l ( )
118 {
119 hpx : : latch l ( 2 ) ;
120
121 std : : s i z e _ t task_size =
122 std : : c e i l ( f l o a t ( stop_ − s t a r t _ ) / ( num_free_ + 1) ) ;
123 std : : s i z e _ t remainder = stop_ − s t a r t _ − task_siz e ;
124
125 i f ( ( num_free_ ! = 0 && task_size > min_task_size_ ) && remainder > 0)
126 {
127 // s p l i t the current task
128 exec_ . post ( s p l i t t a b l e _ t a s k ( exec_ , f_ ,
129 hpx : : u t i l : : make_tuple (
130 stop_ − remainder , stop_ , index_ + 1) ,
131 num_free_ − 1 , spl i t_type_ , min_task_size_ ) ,
132 &l ) ;




137 l . count_down ( 1 ) ;
138 }
139
140 f_ ( hpx : : u t i l : : make_tuple ( s t a r t _ , stop_ − s t a r t _ , index_ ) ) ;
141
142 // wait for task scheduled above




147 std : : s i z e _ t s t a r t _ ;
148 std : : s i z e _ t stop_ ;
149 std : : s i z e _ t index_ ;
150 std : : s i z e _ t num_free_ ;
151 F f_ ;
152 Executor& exec_ ;
153 splittable_mode s p l i t _ t y p e _ ;
124
154 std : : s i z e _ t min_task_size_ ;
155 char const * desc_ ;
156 } ;
157
158 template <typename Executor , typename F , typename Shape>
159 s p l i t t a b l e _ t a s k <Executor , typename std : : decay<F > : : type>
160 make_splittable_task ( Executor& exec , F&& f , Shape const& s ,
161 splittable_mode spl i t_type , std : : s i z e _ t min_task_size )
162 {
163 std : : s i z e _ t num_free = hpx : : get_os_thread_count ( ) − 1 ;
164 return s p l i t t a b l e _ t a s k <Executor , typename std : : decay<F > : : type >(
165 exec , std : : forward<F>( f ) , s , num_free , spl i t_type , min_task_size ) ;
166 }
167
168 } } } // namespace hpx : : p a r a l l e l : : execution
169 #endif // HPX_SPLITTABLE_TASK_HPP
125
Listing A.2: Implementation of splittable executor in HPX
1 // Copyright ( c ) 2007−2020 Hartmut Kaiser
2 // Copyright ( c ) 2020 Shahrzad Shirzad
3 //
4 // SPDX−License−I d e n t i f i e r : BSL−1.0
5 // Distributed under the Boost Software License , Version 1 . 0 . ( See accompanying
6 // f i l e LICENSE_1_0 . t x t or copy at http : / /www. boost . org /LICENSE_1_0 . t x t )
7
8 #include <hpx/ config . hpp>
9 #include <hpx/ execution / t r a i t s / is_executor . hpp>
10 #include <hpx/ executors / d e t a i l / s p l i t t a b l e _ t a s k . hpp>
11 #include <hpx/ include /async . hpp>
12 #include <hpx/modules/ executors . hpp>
13 #include <hpx/modules/ iterator_support . hpp>
14 #include <hpx/modules/ s e r i a l i z a t i o n . hpp>
15 #include <hpx/modules/ timing . hpp>
16 #include <algorithm >
17 #include <cstddef >
18 #include <cstdint >
19 #include < t y p e _ t r a i t s >
20
21 namespace hpx { namespace p a r a l l e l { namespace execution {
22 // / /// // /// // /// /// // /// /// // /// /// // /// /// // /// /// // /// /// // /// /// // /// ///
23 // / Loop i t e r a t i o n s are divided into pieces and then assigned to threads .
24 // / The number of loop i t e r a t i o n s combined i s determined based on
25 // / measurements of how long the execution of 1% of the o v e r a l l number of
26 // / i t e r a t i o n s takes .
27 // / This executor parameters type makes sure that as many loop i t e r a t i o n s
28 // / are combined as necessary to run for the amount of time speci f ied .
29 // /
30 s t r u c t s p l i t t a b l e _ e x e cu t o r
31 : paral lel_policy_executor <hpx : : launch : : async_policy >
32 {
33 using base_type = paral lel_policy_executor <hpx : : launch : : async_policy >;
34
35 public :
36 // / Construct an \a s p l i t ta b l e _ e x e c u t o r executor parameters object
37 // /
38 // / \note Default constructed \a s p l i t ta b l e _ e x e c u t o r executor parameter
39 // / types w i l l use 80 microseconds as the minimal time for which
40 // / any of the scheduled chunks should run .
41 // /
42 s p l i t t a b l e _ e x e cu t o r ( )
43 : s p l i t _ t y p e _ ( splittable_mode : : a l l )




48 // / Construct an \a s p l i t ta b l e _ e x e c u t o r executor parameters object
49 // /
50 // / \param rel_time [ in ] The time duration to use as the minimum
51 // / to decide how many loop i t e r a t i o n s should be
52 // / combined .
53
54 s p l i t t a b l e _ e x e cu t o r ( splittable_mode s p l i t _ t y p e )
55 : s p l i t _ t y p e _ ( s p l i t _ t y p e )
56 , min_task_size_ ( 0 )
57 {
58 i f ( s p l i t _ t y p e ! = splittable_mode : : a l l &&
59 s p l i t _ t y p e != splittable_mode : : i d l e &&
60 s p l i t _ t y p e != splittable_mode : : idle_mask &&
61 s p l i t _ t y p e != splittable_mode : : a l l _ mu l t i pl e_ ta s ks )
62 {
63 HPX_THROW_EXCEPTION( hpx : : bad_parameter ,
64 " s p l i t t a b l e _ e x e c u t o r : : s p l i t ta b l e _ e x e c u t o r " ,
65 "unknown type , type should be e ithe r a l l , idle , "




70 s p l i t t a b l e _ e x e cu t o r (
71 splittable_mode spl i t_type , std : : s i z e _ t min_task_size )
72 : s p l i t _ t y p e _ ( s p l i t _ t y p e )
73 , min_task_size_ ( min_task_size )
74 {
75 i f ( s p l i t _ t y p e ! = splittable_mode : : a l l &&
76 s p l i t _ t y p e != splittable_mode : : i d l e &&
77 s p l i t _ t y p e != splittable_mode : : idle_mask &&
78 s p l i t _ t y p e != splittable_mode : : a l l _ mu l t i pl e _ ta s ks )
126
79 {
80 HPX_THROW_EXCEPTION( hpx : : bad_parameter ,
81 " s p l i t t a b l e _ e x e c u t o r : : s p l i t ta b l e _ e x e c u t o r " ,
82 "unknown type , type should be e ithe r a l l , idle , "




87 // / \cond NOINTERNAL
88 // Estimate a chunk s i z e based on number of cores used .
89 template <typename Parameters , typename F>
90 s t a t i c std : : s i z e _ t get_chunk_size (
91 Parameters&&, F&&, std : : s ize_t , std : : s i z e _ t count )
92 {
93 return count ;
94 }
95
96 HPX_FORCEINLINE s t a t i c std : : s i z e _ t processing_units_count ( )
97 {
98 return hpx : : get_os_thread_count ( ) ;
99 }
100 // / \endcond
101
102 template <typename F , typename S , typename . . . Ts>
103 std : : vector <hpx : : future <
104 typename d e t a i l : : bulk_function_result <F , S , Ts . . . > : : type>>
105 bulk_async_execute (F&& f , S const& shape , Ts &&. . . t s )
106 {
107 std : : vector <hpx : : future <
108 typename d e t a i l : : bulk_function_result <F , S , Ts . . . > : : type>>
109 r e s u l t s ;
110 r e s u l t s . reserve ( hpx : : u t i l : : s i z e ( shape ) ) ;
111
112 for ( auto const& elem : shape )
113 {
114 r e s u l t s . push_back ( hpx : : async ( make_splittable_task (
115 s t a t i c _ c a s t <base_type&>(* t h i s ) , std : : forward<F>( f ) , elem ,
116 spl i t_type_ , min_task_size_ ) ) ) ;
117 }
118




123 fr iend c l a s s hpx : : s e r i a l i z a t i o n : : access ;
124 splittable_mode s p l i t _ t y p e _ ;
125 std : : s i z e _ t min_task_size_ ;
126 } ;
127
128 // / \cond NOINTERNAL
129 template <>




134 # i f HPX_VERSION_FULL < 0x010500
135 // workaround for older HPX versions
136 template <typename Param , typename Executor , typename F>
137 std : : s i z e _ t get_chunk_size (Param&& param , Executor&& exec , F&& f ,
138 std : : s i z e _ t core , std : : s i z e _ t count )
139 {
140 return count ;
141 }
142
143 template <typename Param , typename Executor >
144 HPX_FORCEINLINE s t a t i c std : : s i z e _ t processing_units_count (
145 Param&& params , Executor&& exec )
146 {
147 return hpx : : get_os_thread_count ( ) ;
148 }
149 #endif
150 // / \endcond
151 } } } // namespace hpx : : p a r a l l e l : : execution
127
Vita
Shahrzad Shirzad got her Bachelor of Science degree in Electrical Engineering from Sharif
University of Technology in Iran in 2006, and her Master’s degree in Biomedical Engineering
from K.N. Toosi University of Technology in Iran in 2009. After moving to Baton Rouge in 2013
she started her PhD studies in Computer Engineering. After getting her Master’s degree in En-
gineering Science, IT concentration, she joined the Ste||ar group at Center for Computation
and Technology at LSU, where she mostly worked on Phylanx, a distributed array processing
toolkit, under Dr. Kaiser’s supervision.
A.1. Publications
- Hasheminezhad, B., Shirzad, S., Wu, N., Diehl, P., Schulz, H., & Kaiser, H. (2020). Towards
a Scalable and Distributed Infrastructure for Deep Learning Applications. arXiv preprint
arXiv:2010.03012.
- Kaiser, H., Diehl, P., Lemoine, A. S., Lelbach, B. A., Amini, P., Berge, A., ... & Huck, K.
(2020). HPX-The C++ Standard Library for Parallelism and Concurrency. Journal of
Open Source Software, 5(53), 2352.
- Brandt, S. R., Bigelow, A., Sakin, S. A., Williams, K., Isaacs, K. E., Huck, K., ... & Kaiser, H.
(2020). JetLag: An Interactive, Asynchronous Array Computing Environment. In Prac-
tice and Experience in Advanced Research Computing (pp. 8-12).
- Zhang, T., Shirzad, S., Wagle, B., Lemoine, A. S., Diehl, P., Kaiser, H. (2020). Supporting
OpenMP 5.0 Tasks in hpxMP–A study of an OpenMP implementation within Task Based
Runtime Systems. arXiv preprint arXiv:2002.07970.
- Laberge, G., Shirzad, S., Diehl, P., Kaiser, H., Prudhomme, S., & Lemoine, A. S. (2019,
November). Scheduling Optimization of Parallel Linear Algebra Algorithms Using Su-
pervised Learning. In 2019 IEEE/ACM Workshop on Machine Learning in High Perfor-
mance Computing Environments (MLHPC) (pp. 31-43). IEEE.
- Zhang, T., Shirzad, S., Diehl, P., Tohid, R., Wei, W., Kaiser. H.. 2019. An Introduction to
hpxMP: A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-
Task System. In Proceedings of the International Workshop on OpenCL (IWOCL’19).
128
ACM, New York, NY, USA, Article 13, 10 pages.
DOI: https://doi.org/10.1145/3318170.3318191.
- Tohid, R., Wagle, B., Shirzad, S., Diehl, P., Serio, A., Kheirkhahan, A., ... & Brandt, S. (2018,
November). Asynchronous execution of python code on task-based runtime systems.
In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models
and Middleware (ESPM2) (pp. 37-45). IEEE.
129
