University of Alabama in Huntsville

LOUIS
Theses

UAH Electronic Theses and Dissertations

2022

A case for simple, low-cost autotuning heuristics for efficient
performance of irregular algorithms
Phillip Allen Lane

Follow this and additional works at: https://louis.uah.edu/uah-theses

Recommended Citation
Lane, Phillip Allen, "A case for simple, low-cost autotuning heuristics for efficient performance of irregular
algorithms" (2022). Theses. 384.
https://louis.uah.edu/uah-theses/384

This Thesis is brought to you for free and open access by the UAH Electronic Theses and Dissertations at LOUIS. It
has been accepted for inclusion in Theses by an authorized administrator of LOUIS.

A CASE FOR SIMPLE, LOW-COST AUTOTUNING
HEURISTICS FOR EFFICIENT PERFORMANCE OF
IRREGULAR ALGORITHMS

by

PHILLIP ALLEN LANE

A THESIS

Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in
The Department of Computer Science
to
The School of Graduate Studies
of
The University of Alabama in Huntsville

HUNTSVILLE, ALABAMA
2022

In presenting this thesis in partial fulﬁllment of the requirements for a master’s degree
from The University of Alabama in Huntsville, I agree that the Library of this University
shall make it freely available for inspection. I further agree that permission for extensive
copying for scholarly purposes may be granted by my advisor or, in his/her absence, by
the Chair of the Department or the Dean of the School of Graduate Studies. It is also
understood that due recognition shall be given to me and to The University of Alabama
in Huntsville in any scholarly use which may be made of any material in this thesis.

October 10, 2022

Phillip Allen Lane

(date)

ii

THESIS APPROVAL FORM
Submitted by Phillip Allen Lane in partial fulﬁllment of the requirements for the degree
of Master of Science in Computer Science in Computer Science and accepted on behalf
of the Faculty of the School of Graduate Studies by the thesis committee.
We, the undersigned members of the Graduate Faculty of The University of Alabama
in Huntsville, certify that we have advised and/or supervised the candidate of the work
described in this thesis. We further certify that we have reviewed the thesis manuscript
and approve it in partial fulﬁllment of the requirements for the degree of Master of Science
in Computer Science in Computer Science.

10/13/22

Dr. Joshua Booth

(Date)

Committee Chair

10/18/2022
Dr. Huaming Zhang

(Date)

Dr. Mark Pekker

(Date)

Digitally signed by Mikel D. Petty
DN: cn=Mikel D. Petty, o=University of
Alabama in Huntsville, ou=Computer
Science, email=pettym@uah.edu, c=US
Date: 2022.10.24 09:57:15 -05'00'

2022-10-24

___________
Dr. Letha Etzkorn
(Date)
Dr. Mikel Petty, Acting Department Chair
Digitally signed by Rainer Steinwandt
Date: 2022.10.27 10:09:08 -05'00'

Dr. Rainer Steinwandt

(Date)
Digitally signed by Jon Hakkila
DN: cn=Jon Hakkila, o=UAH, ou=Associate Provost/
Graduate Dean, email=jon.hakkila@uah.edu, c=US
Date: 2022.11.29 11:06:55 -06'00'
Adobe Acrobat version: 2020.005.30418

Dr. Jon Hakkila

(Date)

iii

Department Chair

College Dean

Graduate Dean

School of Graduate Studies
The University of Alabama in Huntsville
Degree

Master of Science

College/Dept. Science/Computer Science

in Computer Science
Name of Candidate
Title

Phillip Allen Lane

A Case for Simple, Low-Cost Autotuning Heuristics
for Eﬃcient Performance of Irregular Algorithms
Irregular algorithms (i.e., algorithms that make irregular memory accesses) are

notorious for being diﬃcult to optimize on parallel architectures due to load imbalance
in the number of computational operations and memory access overheads. However, they
are critically important as they include algorithms for sparse linear algebra and graph algorithms. Common optimizations include schedulers, orderings, and data structures that
try to mitigate workload imbalance across computational units. Many current mitigations of this problem necessitate expert tuning by high-performance engineers. Methods
for autotuning, which do not require an expert, are developing a trend towards large
complex heuristics that are expensive in both needed memory and computation time.
This thesis presents a case for cheaper and simpler autotuning heuristics via case studies of sparse matrix-vector multiplication and loop scheduling, and demonstrates that
scaled-down heuristics can lower startup costs and still provide excellent performance.
Abstract Approval: Committee Chair

Dr. Joshua Booth

Digitally signed by Mikel D. Petty
DN: cn=Mikel D. Petty, o=University of
Alabama in Huntsville, ou=Computer
Science, email=pettym@uah.edu,
c=US
Date: 2022.10.24 10:02:11 -05'00'

Department Chair

___________
Dr. Letha Etzkorn
Dr. Mikel Petty, Acting Department Chair
Digitally signed by Jon Hakkila
DN: cn=Jon Hakkila, o=UAH, ou=Associate Provost/
Graduate Dean, email=jon.hakkila@uah.edu, c=US
Date: 2022.11.29 11:07:16 -06'00'
Adobe Acrobat version: 2020.005.30418

Graduate Dean

Dr. Jon Hakkila

iv

ACKNOWLEDGMENTS

This work was made possible in part by support from the Alabama Supercomputer Authority, Sandia National Laboratories, and NSF2044633. This work used the
Extreme Science and Engineering Discovery Environment (XSEDE), which is supported
by National Science Foundation grant number ACI-1548562. Most importantly, this work
would not have been made possible without my advisor and mentor, Dr. Joshua Booth.

v

To all of my professors and mentors who saw
something in me that I didn’t see in myself.

vi

TABLE OF CONTENTS

List of Figures

xi

List of Tables

xiii

Chapter
1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

The Loop Scheduling (LS) Problem . . . . . . . . . . . . . . . . . . . . .

4

1.3

Sparse Matrix-Vector Multiplication (SpMV) . . . . . . . . . . . . . . . .

5

2 iCh for Irregular Loop Scheduling

7

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.1

Loop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.2

Irregular Applications . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.2.1

Breadth-First Search . . . . . . . . . . . . . . . . . . . .

12

2.2.2.2

Sparse Matrix-Vector Multiplication . . . . . . . . . . .

13

2.2.2.3

Irregular Application Insight

. . . . . . . . . . . . . . .

15

2.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4

Adaptive Runtime Chunk for Irregular Applications . . . . . . . . . . . .

19

2.4.1

21

Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

2.5

2.6

2.4.2

Local Adaption . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.4.3

Remote Work Stealing . . . . . . . . . . . . . . . . . . . . . . . .

26

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.5.1

Test Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.5.1.1

Synth (synth) . . . . . . . . . . . . . . . . . . . . . . .

28

2.5.1.2

Breadth-first search (BF) . . . . . . . . . . . . . . . . . .

29

2.5.1.3

K-Means (Kmeans) . . . . . . . . . . . . . . . . . . . . .

30

2.5.1.4

LavaMD (LavaMD) . . . . . . . . . . . . . . . . . . . . .

30

2.5.1.5

Sparse Matrix-Vector Multiplication (SpMV) . . . . . .

31

2.5.2

Self-Scheduling Methods . . . . . . . . . . . . . . . . . . . . . . .

32

2.5.3

Test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.6.1

Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.6.1.1

Synth . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.6.1.2

Breadth-first search . . . . . . . . . . . . . . . . . . . .

36

2.6.1.3

K-Means . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.6.1.4

LavaMD . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.6.1.5

Sparse Matrix-Vector Multiplication . . . . . . . . . . .

39

2.6.1.6

Insight from all applications . . . . . . . . . . . . . . . .

40

Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

2.6.2
2.7

3 Heterogeneous Sparse Matrix-Vector Multiplication

viii

47

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.2.1

Formats for CPU . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.2.2

CSR-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.2.3

Formats for GPU . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.2.4

Heterogeneous Formats . . . . . . . . . . . . . . . . . . . . . . . .

56

3.2.5

Autotuning SpMV . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.3

CSR-k for Exploiting Multidimensional Block Structure in NVIDIA GPUs 58
3.3.1

Banding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Tuning Optimal Structure . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.4.1

GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.4.2

CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.5.1

Test Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.5.2

Tested Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.5.3

Test Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

3.5.4

Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

GPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.6.1

Banding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

3.7

CPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.8

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

3.9

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

3.4

3.5

3.6

ix

4 Conclusion

87

References

89

x

LIST OF FIGURES

FIGURE

PAGE

2.1

Representations of irregular inputs. . . . . . . . . . . . . . . . . . . . . .

14

2.2

Time step analysis of iCh for 3 threads. . . . . . . . . . . . . . . . . . . .

20

2.3

Examples of distributions for synth and LavaMD. . . . . . . . . . . . . . .

29

2.4

Synth Speedup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.5

Speedup for Breadth-first Search (a) and K-Means (b). . . . . . . . . . .

37

2.6

Speedup for LavaMD and SpMV. . . . . . . . . . . . . . . . . . . . . . .

38

2.7

Sensitivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.1

An example roofline model for GPUs for the NVIDIA Ampere 100. . . .

48

3.2

Example of the CSR-3 data structure . . . . . . . . . . . . . . . . . . . .

52

3.3

Diagram of the NVIDIA GPU memory hierarchy. . . . . . . . . . . . . .

59

3.4

CUDA grid and block layout for GPU SpMV . . . . . . . . . . . . . . . .

61

3.5

Performance on Volta. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.6

Performance on Ampere. . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.7

Banding analysis of KokkosKernels and CSR-k with miscellaneous banding
combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.8

Performance on Ice Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.9

Performance on Rome. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.10 Scalability study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

3.11 Constant super-row size on Rome, with SR fixed to 96. . . . . . . . . . .

83

xi

3.12 Storage overhead percentage of CSR-3 versus base CSR. . . . . . . . . .

xii

85

LIST OF TABLES

TABLE

PAGE

2.1

Input Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.2

Scheduling Methods used to test. . . . . . . . . . . . . . . . . . . . . . .

32

3.1

All systems we use for testing. . . . . . . . . . . . . . . . . . . . . . . . .

71

3.2

Test suite of matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

xiii

CHAPTER 1

INTRODUCTION

Research is to see what everybody else has
seen, and to think what nobody else has
thought.
—Albert Szent-Gyorgyi
1.1

Motivation

Most standard algorithms are iterative, meaning they consist of nested for or
while loops. However, we can characterize the workload of these iterative algorithms as
either dense or irregular. In dense workloads, the amount of computation per iteration
is normally fixed, and memory accesses are made in a straightforward, strided pattern.
In contrast, irregular workloads tend to have a variable amount of computation per iteration, and the memory accesses tend to be more random. While many key kernels,
e.g., dense matrix-matrix multiplication, are of the first kind, many more interesting algorithms, e.g., many graph algorithms and sparse linear algebra, are irregular workloads.
As an example of an irregular algorithm, take for example, a circuit simulation. One
of the core kernels in a circuit simulation problem is sparse matrix-vector multiplication
(SpMV), the operation of computing y ← Ax, where A is a sparse matrix (a matrix
with mostly zeros) and x, y are dense vectors. SpMV is usually written with a two-level

1

nested loop. The outermost loop of SpMV, the one usually parallelized when considering
a CPU implementation, iterates across rows of the matrix. Each iteration must then
further iterate through the nonzeros of a matrix, which may vary from row to row. In
a highly irregular matrix, certain denser regions may take far more computational time
than other, sparser regions. Therefore, each iteration of the outermost loop may take
varying amounts of time, meaning that if trivial parallelization is used (e.g. distribute
iterations of rows evenly amongst cores), there may be a workload imbalance. Alternatively, consider an N -body simulation computing interactions between particles. If
parallelized, each iteration will be responsible for a region of space in which particles
may reside. If particles are clustered in one or two regions, e.g., grouped by gravitational
forces, then the threads responsible for those iterations may take much longer to execute
a time-step than iterations responsible for few or no particles.
In general, these irregular workloads are an NP-hard problem to schedule when
the objective is to minimize execution time [1,2]. Therefore, approaches which are able to
efficiently approximate a good solution are important. Many such approaches have been
explored in literature [3–9], but all are approximations of an optimal solution. Certain
ahead-of-time approaches have been explored, such as workload-aware self scheduling [3]
and workload-balancing data structures [6]. Alternatively, many runtime approaches have
been explored, which dynamically adapt to the workload, such as OpenMP’s dynamic or
guided scheduling.
Equally as important as the theoretical approximation of this difficult optimization
problem is an efficient, hardware-aware implementation. Many times, a base implementation exists for a given problem but requires specific tuning by the user to obtain maximal
2

performance, such as the formerly widely used GotoBLAS library [10, 11]. Despite these
users being high-performance engineers themselves, most find no pleasure in tuning, as
it is a time-consuming and tedious process. Therefore, many ideas for autotuning have
been explored. For instance, related to SpMV, some recent work showcases deep neural
networks for format selection [12], as certain matrices prefer certain storage formats with
little clear correlation. However, the cited work discloses a 75-hour training time that
must be adapted to each new system. Such a costly training time is difficult or impossible
to amortize.

I believe that simpler, scaled-down heuristics are still very capable of good performance and I present evidence of this in this thesis.

To support this claim, the evidence includes the following. In chapter 2, I explore
cheap runtime tuning for a novel loop self -scheduler called iCh, which has been published
in Concurrency and Computation: Practice and Experience [4]. Self-schedulers are schedulers where the scheduling of work items onto workers is done by the application, and not
by the operating system. Many self-schedulers like OpenMP dynamic require optimal
“chunk size” to perform well. Optimal chunk sizes vary depending from application to
application, or even loop to loop. This is a highly empirical and experimental process
that often requires simply testing a wide array of chunk sizes. Too high and you get
workload imbalance; too low and you get scheduler overhead. iCh is capable of adapting
its own chunk size based upon simple ratios, which are computationally cheap. A followup to that work is in progress which expands iCh to NUMA-complex CPUs, though it
3

is not included in this thesis. iCh takes the idea of scaled-down, cheap heuristics and
shows good performance on a wide array of irregular applications. This is in opposition
to other novel OpenMP schedulers like BinLPT [3] which require the workload patterns
to be known ahead-of-time and performs complex and expensive calculations to achieve
workload balancing.
In chapter 3, I explore constant-time tuning of the CSR-k data structure for GPU
computation of SpMV. The process is very cheap: a small set of representative matrices
are measured against a set of tuning parameters, and the optimal parameters are run
through a logarithmic regression to obtain a closed-form heuristic that only needs the
matrix row density. With this O(1) tuning, we are able to outperform leading SpMV
implementations on GPUs by 17 − 19%.
The next two sections give a brief introduction to each of these two topics in more
detail.

1.2

The Loop Scheduling Problem

Large multicore systems dominate parallel computing platforms. One common
technique to improve the performance of an application on these systems is loop-level
parallelism, and the popularity of this loop-level parallelism is seen in the number of
built-in parallel-for (par for) implementations in various languages and packages [5, 7].
Therefore, the scalability of these applications depends on scheduling loop iterations
among threads so that the workload, i.e., the amount of computations and memory
accesses, is balanced. This problem is commonly referred to as the loop scheduling
(LS) problem. When the workload is uniformly distributed among the loop iterations,
4

the issue of scheduling iterations becomes fairly trivial. However, irregular applications,
which are common in high-performance computing (HPC) (e.g., sparse linear algebra
and graph algorithms) [13–16], do not have uniform workload distributions, and thus
the assignment of iterations to cores or threads becomes extremely difficult. In fact,
distributing iterations with a nonuniform workload is an NP-hard problem [1]. The
common solution to this problem is self-scheduling, i.e., the threads and cores determine
the task size or the number of iterations themselves rather than by the operating system
or a global control unit. Within this solution exists numerous methods and heuristics [3,
9, 17–19] requiring fine-tuning of parameters and expert knowledge. The scalability of
even a simple par for on nonuniformly distributed workloads depends on selecting the
best self-scheduling method and fine-tuning of parameters, such as chunk size. If the
selection and tuning are not done, the performance may be magnitudes different from
the tuned solutions or may not speed up performance at all.

1.3

Sparse Matrix-Vector Multiplication

Sparse matrix-vector multiplication (SpMV) has remained one of the most important kernels in HPC due to its use in applications such as machine learning and iterative
solvers (e.g., Conjugate Gradient (CG) and GMRES for partial differential equations
(PDEs)) [20]. Modern systems are composed of numerous heterogeneous compute units,
e.g., many-core CPUs, GPUs, and FPGAs. Each of these compute units have different
microarchitecture properties that influence how they access and process data. Due to
SpMV’s importance and shift in system design, a need exists for a heterogeneous format, i.e., a data structure that can be efficiently utilized by different compute units, for
5

SpMV. This SpMV format needs to exist so that it requires only minor adjustments and
minimal storage overheads, so as to not incur unnecessary and further demand on the
memory subsystem. Overall, the problem of constructing such a format is difficult. Even
on many-core systems, optimizing the performance of SpMV requires tuning given the
sparse matrix, microarchitecture of the hardware, and the number of SpMV operations
the application will execute [6,21,22]. Optimizing on accelerators, such as GPUs, can be
even more difficult and many of the optimization choices utilized for many-core systems,
such as blocking and base format, are completely different.

6

CHAPTER 2

ICH FOR IRREGULAR LOOP SCHEDULING

2.1

Introduction

This work was published in Concurrency and Computation: Practice and Experience in 2022 with my advisor, Dr. Joshua Booth, as the first author and me as the
second author. My contributions included cleaning up the initial code written by my
advisor in libGOMP, running the test suite, analysis of data, and helping with writing
and figures.
This work presents a new self-scheduling method that achieves good scalable performance across a wide range of applications with various loop workload distributions,
and thus, this work removes the importance of self-scheduling method selection with minimal fine-tuning. In general, self-scheduling methods can be classified as: on-demand (i.e.,
those that assign tasks at runtime to improve imbalance), chunk size tuning (i.e., those
that break tasks into chunks that will minimize scheduling overheads), or workload aware
(i.e., those that use the known workload distribution to assign tasks to threads). Methods in each classification have unique benefits related to the classification that matches
different applications and loop workloads, but none work equally well on all applications
and workloads. The self-scheduling method in this work bridges the shortcomings of
7

methods in each of these classifications by combining design techniques from the first
two to produce one self-scheduling method that does well for an array of different applications and does not need to be explicitly informed of the workload distribution like
workload aware. As such, this method saves time in both implementations (i.e., not
having to precalculate workload distribution for the scheduling method) and tuning (i.e.,
other scheduling methods do not need to be considered).
In order to achieve this flexibility, this new method uses distributed work queues
and work-stealing [7, 23] from on-demand methods to balance workloads, uses a newly
developed adaptive chunk size tuning to reduce scheduling overheads of both per-chunk
assignment and locks while work-stealing, and uses a new inexpensive calculation of the
distribution of iteration throughput during runtime to update the per-thread chunk size
and inform stealing threads about local queues. As this method depends on automatically
adapting a per-thread chunk size for the irregular application, we name the method iCh
(i rregular Chunk). This work is the first to use this form of adaptive chunk size, and it
is the first to apply this type of technique to improve work stealing based self-scheduling.
This thesis aims to present iCh as a low-cost and efficient loop self-scheduler, by
giving the following contributions:
• A brief introduction to LS is given in Section 2.2, along with an introduction to
workload distribution patterns found in irregular applications
• A review of related work for LS and their connection to iCh is provided in Section 2.3
• An overview of the algorithm and mechanics used by iCh in Section 2.4
8

• A comprehensive evaluation of iCh across several different benchmarks in Sections 2.5 and 2.6

2.2

Background

This section describes the background related to loop scheduling (i.e., how the
problem is defined and common methods), and irregular applications (i.e., the characteristics of irregular applications that make them difficult to schedule and insight that
motivated iCh).

2.2.1

Loop Scheduling
The LS problem requires scheduling a set of n independent iterations, xi where

i ∈ {1, . . . , n}, onto p threads, tj where j ∈ {1, . . . , p}, in a manner that minimizes
the total parallel execution time. Though this problem is NP-hard [1], the importance
of the problem has inspired numerous self-scheduling methods built around heuristics,
theoretical insight, and anecdotal evidence. Most heuristics and anecdotal evidence make
some assumptions about the workload distribution. For example, many methods assume
that the iteration workload does not vary much or assume workloads are pulled from a
normal distribution [24]. Additionally, they focus on improving overall load balance while
minimizing overheads related to utilizing queues or bookkeeping variables that estimate
balance or chunk size. Many self-scheduling methods try to keep the method itself as
simple as possible to not increase overheads, and will normally chunk loop iterations
linearly (i.e., loop iteration xi+1 will usually be assigned to the same thread as loop
iteration xi ) in order to aid memory reuse and locality. Most of the self-scheduling
9

methods built around theoretical insight stem from work-stealing. In work-stealing, each
thread works on its portion of the loop iterations, and when a thread runs out of work it
randomly steals one-half of the remaining tasks from some other thread (called the victim)
that has iterations remaining. Work-stealing is shown to be a 2-approximation [1] of the
optimum solution and is very popular in task-based systems like CILK [7]. Despite being
popular, many self-scheduling systems do not rely on work-stealing because of issues
related to the overheads in stealing and preserving locality in memory accesses [23].
However, many of the self-scheduling methods being implemented in the KMP (i.e., a
popular OpenMP library that many compilers utilize, such as Clang and Intel) 1 OpenMP
library of LLVM 2 have reserved flags for possible work-stealing implementations. In more
detail, the three approaches (on-demand, chunk size tuning, and workload aware) from
the previous section can be further classified in the following manner.
• Pure Dynamic Self-Scheduling and Chunk Self-Scheduling assign iterations to threads
from a central, shared queue in unit-sized chunks that do not change during execution [25]. A small chunk size allows for better load balance as iterations are assigned
at a finer granularity but comes with a high assignment overhead. A larger chunk
size assigns iterations with a more coarse granularity leading to worse load balancing but comes with a much lower assignment overhead. Most of these methods are
workload-unaware, but expert knowledge of the application, input, and system can
help tune the chunk sizes, thus making it workload-aware. Additionally, when the
chunk size is fine-tuned, the application can achieve near-optimal load balancing.
1
2

https://github.com/llvm-mirror/openmp/blob/master/runtime/src/kmp.h
https://llvm.org/

10

An example of this is the dynamic scheduling already implemented into OpenMP
(dynamic) and is commonly used in irregular applications [14, 16].
• Guided Self-Scheduling assigns chunks to threads from a central queue, much like
Chunk Self-Scheduling, but dynamically changes chunk size at runtime. The size
change is normally only based on the remaining number of iterations and threads
making the method semi-workload-learning, which is sometimes called the Load
Imbalance Amortization Principle [8].
• Factoring Self-Scheduling determines the number of iterations to be assigned not
via a chunk size, but by a fraction of the number of remaining iterations. Both
Guided Self-Scheduling and Factoring Self-Scheduling work well when loop iteration
workloads are normally or uniformly randomly distributed, and it has an almost
averaging effect of workloads in these situations (i.e., they can use the amortization
principle). An example of a commonly used method related to these last two is
guided in OpenMP.
• Workload-Aware Self-Scheduling uses expert information, i.e., known information
about the workload distribution that must be precalculated, to distribute the iterations among threads [3]. A second-level method, such as work stealing, is commonly
used to deal with variations caused by the system, such as frequency scaling and
memory access latency.
• History-Based Self-Scheduling tries to build up information about the workload
during execution [9]. In practice, most require a nested-loop structure, where the

11

outermost loop serially executes the parallel innermost loop. During the first few
iterations of the outermost loop, a history is built that helps distribute the parallel
iterations. However, if the time does not exist to train or the workload changes for
the innermost loop, then the method may have difficulties achieving good speedups.

2.2.2

Irregular Applications
Many common HPC applications have subsections of code that make irregular

accesses to memory in parallel, such as methods that deal with sparse matrices or graph
algorithms. These types of applications are commonly called irregular applications or
Type-I applications [26]. Some common examples include graph mining and web crawls.
However, these applications require a great deal of tuning based on both the computer
system and algorithm input to perform optimally. Many different programming models are used to implement these kernels, but one of the most common is a fork-join
model. Additionally, many of these kernels are memory-bound even when optimally programmed [13]. This means that many memory requests are already waiting to be fulfilled,
and additional requests will have high latency on an already busy memory system. In
order to explore this, we will look at two important irregular applications: Breadth-First
Search and Sparse Matrix-Vector Multiplication.

2.2.2.1

Breadth-First Search

The breadth-first search application is commonly used inside many graph-analytic
metrics, such as building spanning trees or calculating betweenness of centrality. Additionally, many search-space problems that are solved using branching-and-bounding will

12

explore the tree in a breadth-first search manner, and many scheduling algorithms that
rely on task-levels for thread assignment and synchronization points explore tasks in a
breadth-first search manner [27]. As such, the computation pattern of breadth-first search
is critical to much of HPC. However, the execution of breadth-first search is highly dependent on the graph input, as seen in Section 2.6. As such, selecting the right self-scheduling
method would be difficult, and built-in scheduler methods have been constructed for several HPC breadth-first search implementations that are application, hardware, and input
specific [14, 27].

2.2.2.2

Sparse Matrix-Vector Multiplication

The SpMV application is a highly studied and optimized kernel due to its importance in many applications [16, 28–30]. However, the irregular structure of the sparse
coefficient matrix makes this difficult. If a one-dimensional layout is applied, the smallest
task of work is the multiplications of all nonzeros in a matrix row by the corresponding
vector entries that are summed together at the end. Figure 2.1a presents the nonzero
structure (i.e., blue representing nonzero entries and white representing zero entries) of
the input matrix arabic-2005, which represents the web crawl of sites written in Arabic, in its natural order. The natural order is the one provided by the input file, and
many times this ordering has some relationship to how elements are normally processed
or the layout of the system. From afar, a static assignment of rows may seem like a
logical choice. To investigate, we bin rows based on nonzero counts with increments
of 50 together, such that the first bin counts the rows with 0-49 nonzero elements and
the second bin counts the total number of rows with 50-99 nonzero elements. In Fig-

13

(a) arabic-2005 in
natural ordering

(b) arabic-2005 in
RCM ordering

(c) Number of rows binned together
based on nonzero count in increments
of 50 for arabic-2005 (y-axis in logscale)

Figure 2.1: Representations of irregular inputs. Figure (a) presents the sparsity pattern
of arabic-2005 in its natural ordering. The blue represents nonzero elements and white
represents zero elements in the matrix. The natural ordering is the order the comes
with the matrix. Figure (b) presents the sparsity pattern of arabic-2005 in its RCM
ordering. The RCM ordering is commonly used in sparse linear algebra to permute
nonzero elements towards the diagonal in order to improve cache performance. Figure (c)
provides the distribution of nonzero elements in the rows of the arabic-2005 matrix. In
order to make the chart more readable, the distribution has been binned into increments
of 50 (e.g., the first dot represents the number of rows with nonzero counts being between
0 and 49).

ure 2.1c, we provide the tally of the number of rows in each bin (provided in logarithmic
scale) for the first 50 bins. For example, the first dot shows that there exists close to
21,000,000 rows (e.g., ∼ 95% of rows in the matrix) with nonzero counts between 0 and
49. We note how much variation exists (σ 2 = 3.0 × 105 ) and the imbalance of work,
such as the last two dots representing bin 49 and 50. Additionally, matrices are often
preordered based on application to provide some structure, such as permuting nonzero
elements towards the diagonal or into a block structure. One such common permutation is the reverse Cuthill-McKee (RCM) [31]. This little structure can provide some
benefits to hand-tuned codes [16, 29, 32] that can use the newly found structure to better load balance work. However, Figure 2.1b shows that this could make balancing even

14

harder if rows were assigned linearly. Though orderings like RCM may improve execution
time [14, 16, 31], these orderings may make tuning for chunk size more important.

2.2.2.3

Irregular Application Insight

Despite the irregular nature based on input, there does exist some local structure
normally. For example, rows or subblocks within a matrix that have a large number
of nonzero elements could be grouped based on some ordering. The same can also be
applied to graph algorithms like breadth-first search. Even if the given input does not
come with this structure, the input can be permuted to have it. This type of permutation
or reordering is commonly done to improve performance [14, 16, 29, 30]. Therefore, a
thread could, in fact, adapt its own chunk size to fit the local task length. Moreover, if
a thread is finished with its own work, it could be intelligent in stealing work based on
the workload of others. This does require some computation overheads such as keeping
track of workload and communicating this workload to nearby neighbors. Since most of
these applications are memory-bound, there does exist a certain amount of availability of
computational resources and time during computing. In particular, Aupy, Benoit, Dai,
et al. [33] show that through intelligent co-scheduling of applications performance can be
seen not to degrade due to cache overuse, and total performance (i.e., the time to execute
all applications) can be improved. This observation allows us to do some small amount
of intelligent record keeping and make smart decisions, and also makes auto-tuning for
chunk size even more important (e.g., co-scheduling multiple applications).

15

2.3

Related Work

Subramanian and Eager [24] introduce an affinity loop scheduler for unbalanced
workloads for a series of parallel loops if “execution time of any particular iteration does
not vary widely from one execution of the loop to the next”. Their work uses multiple
loop executions to learn to balance the workload, and later work builds around this
fundamental work. However, iCh tries to forge in the new direction of not using multiple
loops and recognizes the variable nature of modern systems. Work by Yan, Jin, and
Zhang [34] adds a dynamic history to decide about load-balance on distributed sharedmemory systems. The adaptive algorithm in iCh is inspired by their work. They both
use running sums of iterations and increase chunk size by a factor of 2. However, they
attempt to optimize for load-balance, and most of their classification end up being the
opposite of those in iCh (i.e., because iCh tries to optimize for differences in iteration
throughput that might affect stealing). Their work considers how to keep and update
a history on distributed shared-memory systems, such as KSR-1 and Convex up to 16
CPUs, that could have high delays and a different memory architecture compared to
today’s modern systems. They test their method on the applications of Jacobi iteration
method, transitive closure, and successive over-relaxation, matrix-matrix multiplication,
and adjoint convolution. However, both Jacobi iteration method and successive overrelaxation reduce to SpMV, and the input sparse matrix they use is a banded matrix
with a somewhat regular structure. Yan does not compare against any modern selfscheduling method but demonstrates low overhead and about a 4× ∼ 5.5× speedup
using their method on 16 CPUs.

16

Additionally, more modern irregular loop methods are studied [3, 9, 17, 35]. Banicescu, Velusamy, and Davaprasd [17] schedule loops of irregular applications in a distributed fashion with MPI. The chunk size is determined by a direct fraction (i.e., a
factoring self-scheduling method) of the cumulative number of tasks complete and the
processor speed, called Adaptive Weighted Factoring (AWF). However, AWF suffers from
needing to know the processor speed, and chunk size is updated based on its history rather
than by a fixed amount in iCh. To evaluate AWF, they use a Laplace Equation Solver
using Jacobi iterations and an N-Body Simulation and show up to a 45% improvement
over standard static scheduling. Wang, Ji, Shi et al. [35] introduce the KASS system
that considers adaptively chunking together in the initialization phase using some prior
knowledge (i.e., workload-aware self-scheduling). Chunks in the second phase (i.e., the
dynamic or adaption phase) are reduced fixedly based on information from past iteration
runs but are not adapted (i.e., chunk size can not increase, reduce, or stay the same)
within an iteration, unlike the chunk size in iCh. Chunks are stolen if a queue runs out of
its own work. To evaluate KASS, they use in-house versions of successive over-relaxation,
Jacobi iteration method, and transitive closure. Like the work by Yan, both Jacobi iteration method and successive over-relaxation reduce to SpMV, and the input sparse
matrix they use is a banded matrix with a somewhat regular structure. However, KASS
is only able to obtain 21% improvement to their baseline of guided self-scheduling (i.e., a
method similar to OpenMP guided ). Kejariwal and Nicolau [9] introduce a history-aware
self-scheduling method, called HSS, that changes chunk size from past iterations and the
number of times the task will run using a complex and expensive “best-fit” approximation fit. Compared to HSS, the iCh method uses a simple and cheap method to adapt
17

the chunk size. HSS is evaluated on irregular loops extracted from the Standard Performance Evaluation Corporation (SPEC) Benchmarks

3

and is tested against AWF. Using

a simulator (i.e., not a real system), HSS outperformed the baseline by up to 18%. Their
method does demonstrate benefits for loops that are repeated, but iCh does not consider this as the loops may not vary within an irregular application. Lastly, BinLPT [3]
schedules irregular tasks from a loop using an estimate of the work in each loop and a
maximal number of chunks provided by the user. The method uses this information to
distribute chunks of iterations to local queues that are theoretically balanced. If there is
unbalance a simple chunk self-scheduling method is used to balance and provides little
scheduling overhead. A multiple loop analysis has been added to BinLPT so that it may
better distribute loop iterations when a loop is repeatedly called. This method is one
of the newest and provides good performance in publication. In evaluating BinLPT, a
synthetic benchmark 4 , SpMV, minimal spanning tree built around Prim’s algorithm,
and an N-Body simulation for computational fluid dynamics are used. BinLPT compares against in-house versions of KASS and HSS. On relatively few threads (e.g., < 48),
the performance of BinLPT is about as good as guided . However, BinLPT is able to
outperform guided when using more threads. As such, iCh is tested against BinLPT.
In contrast to BinLPT, the iCh method provides an easier choice that does not require
estimates of loop work and more user input. No other self-scheduler work directly attacks
the problem of providing a method that works well on all irregular applications.
3
4

Available at www.spec.org
Available at www.github.com/lapese/libgomp-benchmarks

18

2.4

Adaptive Runtime Chunk for Irregular Applications

This section provides details related to the algorithms and data structures used by
iCh. The iCh method uses work-stealing as its primary load-balancing mechanic. Despite
the theoretical ability of work-stealing to achieve good load balance, most work-stealing
methods require highly optimized queues, and many implementations utilize multiple
queues mapped to either processing cores or non-uniform memory access (NUMA) regions [18, 19, 23]. These distributed queues help to reduce memory contention when
multiple threads try to access a queue simultaneously. However, the same contention
can happen with distributed queues when the initial partitioning of the workload is uneven. The iCh method makes this observation about contention at distribution iteration
queues and tries to alleviate issues with an adaptive chunk size for each queue. Therefore,
our new method extends the traditional work-stealing method that uses fixed chunk size
by adding an adaptive algorithm for chunk size. The adaptive algorithm for chunk size
allows for a large enough chunk size to limit the number of times a thread has to access
its local queue to dispatch the next active chunk of work but makes the chunk size small
enough so that other threads can steal from its local queue without failure because of
iterations already being assigned to an active chunk. As such, iCh tries to adjust the
chunk size based on a running estimate of the distribution of iteration throughput (i.e.,
informally, the width that observations span away from the mean). The following subsections detail the above based on three phases of execution: initialization, local adaption,
and remote work-stealing. Figure 2.2 provides a visual overview iCh used in the following
subsections.

19

Figure 2.2: Time step analysis of iCh for 3 threads. Each time label is a time step in
the scheduler where new chunks of iterations are issued. To the right of each time, each
thread contains a local queue that is represented by blocks to the right of the thread
number. Each block represents a loop iteration, and each number in the block represents
the amount of work in that iteration. The blocks shaded yellow are actively being worked
on by that thread’s chunk. We note that the starting workload balance is: Thread 0 =
18, Thread 1 = 16, and Thread 2 = 12. The variable k keeps track of the number of
iterations successfully executed. A thread that finishes at a given time step will make a
decision about its own chunk size relative to a range around the mean (µ). This range
(i.e., µ − δ ≤ µ ≤ µ + δ where δ is an estimate of standard deviation) provides insight
into the common variance in workload seen by all threads, and a thread can judge if its
own iteration count is high, low, or normal compared to this range. A decision to steal
will result if the local queue is out of work. A thread that makes this decision during a
time step provides its decision to the right of its own k value in the figure. The running
range that is used for the decision is provided to the right of each time step. Time-Breaks
are provided between Time=8 and Time=12 indicating that other time steps exist where
threads will finish and make decisions, but these time steps are not included in the figure.

20

2.4.1

Initialization
Standard methods like dynamic scheduling in libgomp use a centralized queue

and a single chunk size for all threads, but do not scale well with the number of tasks
and threads needed to service modern many-core systems. Therefore, iCh uses local
queues constructed for each thread that we denote as qi where i ∈ {0, 1, . . . , p − 1} is the
thread ID for p threads. The local structure is memory aligned and is allocated using a
first-touch allocation policy containing a pointer to the local queue, local counter (ki ),
and a variable used to calculate chunk size (di ). The tasks are evenly distributed to
tasking queues such that |qi | = n/p where n is the total number of tasks. Additionally,
ki = 0 and di = p such that the initial chunk size is n/p2 , i.e., chunk size = |qi |/di . The
rationale for this choice is that the scheduler wants to allow a chunk size small enough
that p − 1 threads could steal from the queue later. Moreover, the chunk size becomes
smaller as p increases and allows for the variation of tasks that come with more threads.
Figure 2.2 shows the initial setup of iCh at Time=0. We note the iterations, i.e., the
blocks, are evenly distributed to the local queues, i.e., the list of blocks to the right of
the thread name. Moreover, the initial chunk size is set to 3 ≈ n/p2 as shown by the
blocks colored yellow.

2.4.2

Local Adaption
In traditional work-stealing methods [7, 23], the chunk size is fixed, and any load

imbalance is mitigated through work-stealing once all the tasks in the initial queue have
been executed. However, a thread can only steal work that is not already being actively

21

processed, i.e., not in the active chunk of the victim. Therefore, making the chunk size
too large to start will result in a load imbalance that the self-scheduling method may
not be able to recover from using work-stealing. Additionally, making the chunk size too
small would result in added overhead and possibly more time to converge.
The iCh method improves upon traditional work-stealing by locally adapting
chunk size to better fit the running distribution of iterations completed across threads.
In particular, a thread is classified in regards to its computational load relative to other
threads as

low: ki <

p−1
X

kj /p − δ,

(2.1)

j=0
p−1

normal:

X

kj /p − δ ≤ ki ≤

j=0

high: ki >

p−1
X

kj /p + δ,

(2.2)

j=0
p−1
X

kj /p + δ.

(2.3)

j=0

This classification into the distribution is based off the standard notation of intervals
around the mean (µ), i.e.,
µ−δ ≤µ≤µ+δ

where δ is some multiple of the standard deviation (σ). In our case, µ =

(2.4)
Pp−1
j=0

kj /p,

which is the mean number of iterations complete per thread or otherwise called mean
iteration throughput.
However, the question of selecting δ is of great importance as this will try to
capture the standard deviation of the iteration throughput that exists in the irregular

22

application. For instance, µ ± 2σ would provide an interval of two standard deviations
around the mean, and 95% of the iterations per thread should be in this interval if
they follow a normal distribution. This variance or standard deviation is important as
iterations may vary greatly in the number of float-point operations and memory requests.
Additionally, a single computational core that is mapped to a local thread can vary in
voltage, frequency, and memory bandwidth due to load on the system [36]. As such, even
if the workload was uniform among threads, the difference in execution rates between
threads may be significantly different. Therefore, irregular applications rarely have a
normal distribution of iterations per thread. Additionally, though δ is normally a multiple
of standard deviation, i.e.,
sP
(ki − µ)2
,
σ=
p

(2.5)

the standard calculation would require previous knowledge of the workload. Though
workload-aware scheduling makes use of previous knowledge, information related to voltage and frequency could change from run to run. A running approximation exists [37]:

µt+1 = µt + (kt − µt )/p

(2.6)

2
σt+1
= σt2 + (kt − µt )(kt − µt+1 ).

(2.7)

where t represents the time step. However, keeping past values of standard deviations
and means would be expensive for a lightweight loop scheduler. As such, iCh estimates

23

this value with a fractional multiplier (ϵ) of the running mean, i.e.,

δ=ϵ

p−1
X

kj /p.

(2.8)

j=0

The idea is that variation or standard deviation would be a multiple of the running
mean, and this multiplier would provide a way to either tighten (i.e., make the interval
smaller) or loosen (i.e., make the interval larger) for the user. Additionally, δ will grow
with the number of iterations completed. As a result of this relationship between δ and
the number of completed iterations, iCh is more likely to only adapt chunk size in the
beginning due to extremely large variance in the number of completed iterations, and
the possibility of adapting due to smaller variance increases with execution. This makes
sense as iCh would have a better global understanding of the workload the longer the
application runs, and it would allow more flexibility of chunk size as the local queue
become smaller and work-stealing more important.
In iCh, the only parameter that the user must select is this ϵ in equation 2.8
Additionally, we demonstrate in Section 2.6 that iCh is not that sensitive to ϵ, allowing
for the user to easily pick an ϵ and still have good performance. This observation allows
iCh to be used across different applications, systems, and inputs without hand-tuning by
the user to achieve “good” speedups.
Once the thread is classified into its category of low, normal, or high, the chunk
size of the thread can be adjusted. As noted previously, di is used to directly adjust the
chunk size, i.e., chunk size = |qi |/di . After classification, di is adjusted as follows. If
the thread is classified as low, then di = di /2, and the chunk size would increase (i.e.,

24

chunk size = |qi |/(di /2) = 2|qi |/di ). If the thread is classified as high, the chunk size
would decrease by allowing di = 2 × di . Lastly, if the thread is classed as normal, di
would remain the same. The logic for the direction of updating the values of di is as
follows. If a thread is classified as low, then it is completing fewer iterations than the
mean across all threads. This could be for several reasons, but one likely reason is that
the chunk size is already too small and the thread is spending too much time assigning
the next chunk. One other reason is that the thread is processing iterations slower,
e.g., lower frequency or processing memory requests. We note that even though the
workload is normally thought of in metrics such as the number of float-point operations,
the time of irregular applications is dominated by memory operations, communication,
and scheduling overheads, and the time is not dominated by the irregular proportion
of float-point operations. Therefore, iCh will assign a larger chunk size so that this
thread will be less likely to have to deal with interruptions due to scheduling or stealing.
On the other hand, a thread classified as high is completing iterations faster than the
thread. This thread has time to spare in updating chunk size, scheduling, or dealing with
stealing operations, and therefore, iCh makes the thread chunk smaller. We note that
the direction in which chunk size is updated is in opposition to the logic that one might
have if optimizing to balance the average amount of work assigned to each thread in a
chunk, where a thread classified as high would have its chunk size increased and a thread
classified as low would have its chunk size decreased to attempt to approach the mean
time spent processing a chunk across all cores.
In Figure 2.2, the chunk size adaptation can be observed. At Time=5, only Thread
2 is done with its work, thus is running faster than the other threads. Thread 2 reduces
25

its chunk size by half. This is possible since it is faster, and therefore, more time can be
spent dealing with coming back to the thread queue to leave more iterations that can
be stolen by other threads. By Time=6, Thread 1 observes that it is running within the
appropriate amount of time, i.e, the bound around µ on the far right-hand side, and so
it does not change its chunk size. This trend continues throughout the time steps.

2.4.3

Remote Work Stealing
At some point, the local queues will start to run out of work, and then work-

stealing is used, as in Time=12 in Figure 2.2. Many implementations of work-stealing
methods will fix a chunk size and use the THE protocol [7,23]. In Listing 2.1, we provide
the THE protocol’s stealing algorithm updated with iCh’s needed additional calculations.
The THE protocol tries to steal from a random victim (victim) half of its queue (line 4).
If stealing is unsuccessful (e.g., the victim’s beginning (victim->begin) is updated before
stealing is finished by the victim (line 12) or some other stealing thread) then it rolls back
its attempt while trying to minimize the number of locks required. Note that the victim
is only locked after queue size and other important variables are calculated (line 9), and
the victim is unlocked as soon as possible (line 15 or 18). Additionally, the thief does
not have to lock itself at all, and all this is set up to try to minimize contention dealing
with locks. Unlike the traditional method that has a fixed chunk size, iCh must deal
with the issue of assigning a chunk size to the stealing thread and does not know if the
victim’s thread and stealing thread’s current chunk sizes were due to characterizing the
workload or performance of the thread on the particular computational core. Therefore,
the stealing thread’s di and ki are updated based on the victim’s dj and kj by taking the
26

average of both, i.e., di = (di + dj )/2 and ki = (ki + kj )/2 (line 6 and 7). The reasoning
for this choice is as follows: the stealing thread knows some information from the victim;
however, the stealing thread does not know the accuracy of the information, so it tries
to average out the uncertainty with its own knowledge.
Listing 2.1: Remote stealing from victim with the THE protocol. Given the victim and
thief structures containing the queue’s current begin, end, ki (ki ), di (di ), and thread
lock.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

// Test i f i t e r a t i o n s s t i l l e x i s t t o s t e a l from v i c t i m
i f ( v i c t i m −>end − v i c t i m −>b e g i n > 0 ) {
// C a l c u l a t e h a l f o f t h e r e m a i n i n g i t e r a t i o n s from t h e v i c t i m
l e t h a l f s i z e = ( v i c t i m −>end − v i c t i m −>b e g i n ) / 2 ;
// C a l c u l a t e t h e a v e r a g e k i and d i o f v i c t i m and t h i e f
l e t l o c a l i t e r = ( v i c t i m −>k i + t h i e f −>k i ) / 2 ;
l e t l o c a l c h u n k = ( v i c t i m −>d i + t h i e f −>d i ) / 2 ;
// lock v i c t i m
lock ( v i c t i m )
l e t end = v i c t i m −>end − h a l f s i z e ;
v i c t i m −>end = end ;
i f ( end <= v i c t i m −>b e g i n ) {
// Abort −−− R o l l b a c k i f cannot s t e a l h a l f t h e r e m a i n i n g
v i c t i m −>end = end + c h u n k s i z e ;
unlock ( v i c t i m ) ;
return f a l s e ;
}
unlock ( v i c t i m ) ;
//Make s u r e new chunk s i z e i s v i a b l e
i f ( h a l f s i z e <= l o c a l c h u n k ) {
localchunk = h a l f s i z e ;
}
t h i e f −>b e g i n = end ;
t h i e f −>end = end + h a l f s i z e ;
t h i e f −>k i = l o c a l i t e r ;
t h i e f −>d i = l o c a l c h u n k ;
return true ;
}
return f a l s e ;

27

2.5

Experimental Setup

2.5.1

Test Applications
Test applications are selected that reflect the applications used by other self-

scheduling methods to evaluate. Additionally, other applications are introduced for testing that better define the limits and multiple behaviors that can be found in irregular
applications.

2.5.1.1

Synth (synth)

This application is a synthetic benchmark used by BinLPT [3] to evaluate their
performance. This benchmark allows the user to input a custom workload distribution.
In BinLPT [3], the benchmark is ran with linear, logarithmic, quadratic, and cubic workload distributions, but none of their testings demonstrates any significant performance
variation between these distributions. Therefore, we use two distributions built around
the exponential distribution to demonstrate exponential increase and decay in workloads.
This distribution is more representative of workloads that are highly imbalanced when
the loop either starts or ends. An example of this type of distribution would be tracking
particles in an aerosol as they enter a simulation domain that is evenly distributed, such
as coughing in a room. The iterations that compute simulation work near the entry
location will complete more computations than iterations tracing particles on the far
side of the simulation domain, e.g., the particles in the two-dimensional simulation in
Figure 2.3a. In particular, 1,000,000 samples from the the exponential distributions (i.e.,
pdf (x, β1 ) =

1
exp( −x
))
β
β

are generated with β = 1, 000, 000 and sorted. Entries sorted

28

in increasing order are labeled Exp-Increasing, and those sorted in descending order are
labeled Exp-Decreasing. We note that the range of loop workload is therefore 1,000,000
to 1. Figure 2.3b provides a histogram of the values in the exponential distribution.

(a) Uneven distribution in a particle simulation.

(b) Exponential distribution used for ExpDecreasing (pictured) and Exp-Increasing
(sorted in opposite order as pictured).

Figure 2.3: Examples of distributions for synth and LavaMD. These applications are
highly irregular and imbalanced with most of the work at the beginning of the loop.

2.5.1.2

Breadth-first search (BF)

The BF in the Rodinia [38] benchmark suite for heterogeneous computing is used.
We use two different inputs. The first input is generated using the graph generator
provided that uses the uniform distribution to randomly generate the number of neighborhood vertices (Uniform). The second input is generated using a modified graph generator that uses a power-law distribution to randomly generate the number of neighborhood vertices (Scale-Free) in order to construct a randomly generated scale-free network.
Scale-free networks are those with the fraction of P (k) nodes in the network having k
connections is P (k) ∼ k γ where γ = 2.3 in our tests. Scale-free networks are of particular importance as many common networks, such as social networks, computer networks,
29

protein-protein interactions, etc., are scale-free. Additionally, BF tends to be the building
block kernel in many scalable implementations of graph analytic [14] commonly used on
these networks and is the building block of other graph benchmarks used by others, such
as minimum spanning tree with Prim’s algorithm tested in BinLPT.

2.5.1.3

K-Means (Kmeans)

A Kmeans benchmark, as used in machine learning applications, is used from the
Rodinia Benchmark. In particular, this version uses the KDD Cups dataset related to
network packets, and this dataset is the standard in most Kmeans benchmarks. As a
popular method in machine learning, this application is used by many in HPC. This
benchmark is highly irregular, and the iteration workload distribution in the innermost
loop changes per outermost loop iteration. This particular benchmark should thus be
difficult for history-aware methods.

2.5.1.4

LavaMD (LavaMD)

This application is a Computational Fluid Dynamics (CFD) code that performs NBody Simulations from the Rodinia Benchmark. LavaMD simulates the interactions of solidification of molten tantalum and quenched uranium atoms in a finite three-dimensional
domain. We use an input size of 8 × 8 × 8 to construct the domain. At each step, force
calculations are done for particles within the same box. However, the simulation is set
up to have a cut-off radios of particles interaction at about the size of a box. As such,
little to no calculations are done between boxes. Additionally, particles move relatively

30

small distances, unlike our cough example above in Figure 2.3a. This benchmark is also
used by BinLPT in their evaluation.

2.5.1.5

Sparse Matrix-Vector Multiplication (SpMV)

Table 2.1: Input Graphs. Vertex and edge counts in millions. x̄: average number
of outgoing edges per vertex. ratio: maximal number of outgoing edges over minimal
number of outgoing edges. σ 2 : variance of the number of outgoing edges.
Input
I1: FullChip
I2: circuit5M dc
I3: wikipedia
I4: patents
I5: AS365
I6: delaunay n23
I7: wb-edu
I8: hugebubbles-10
I9: arabic-2005
I10: road usa
I11: nlpkkt240
I12: uk-2005
I13: kmer P1a
I14: kmer A2a
I15: kmer V1r

Area
Freescale
Freescale
Gleich
Pajek
DIMACS
DIMACS
Gleich
DIMACS
LAW
DIMACS
Schenk
LAW
GenBank
GenBank
GenBank

|V |
2.9
3.5
3.5
3.7
3.7
8.3
9.8
19.4
22.7
23.9
27.9
39.4
139.3
170.7
214

|E|
26.6
14.8
45
14.9
22.7
50.3
57.1
58.3
639.9
57.7
760.6
936.3
297.8
360.5
465.4

x̄
8.9
4.2
12.6
3.9
5.9
5.9
5.8
2.9
28.1
2.4
27.1
23.7
2.1
2.1
2.1

ratio
1.1e6
12
1.8e5
762
4.6
7
2.5e4
1
5.7e5
4.5
4.6
1.7e6
20
20
4

σ2
3.2e6
1
6.2e4
31.5
0.7
1.7
2.0e3
0
3.0e5
0.8
4.8
2.7e6
0.4
0.3
0.3

This algorithm is commonly used by sparse linear solver methods (e.g., successive
over-relaxation and Jacobi iterative method), and a great deal of work goes into optimizing it for specific hardware and inputs [16]. Most other self-scheduling evaluations use
SpMV in some form within a solver. The choice of using it within a solver allows there
to be a nested loop to add history. Additionally, other evaluations tend to focus on a
single input matrix with a regular structure. For our inputs, we use a collection of sparse
matrices from the SuiteSparse Collection [39]. Table 2.1 contains the sparse matrices,

31

where the number of vertices and edges are reported in millions (i.e., 106 ). Inputs are
picked due to their size, variation of density, and application areas. In particular, four
application areas are of particular interest. These areas are: Freescale: a collection from
circuit simulation of semiconductors; DIMACS: a collection from the DIMAC challenge
that is designed to further the development of large graph algorithms; LAW: a collection
of laboratory for web algorithms of web crawls to research data compression techniques;
and GenBank: a collection of protein k-mer graphs. Furthermore, we report the average
row density (x̄), the ratio of the maximal number of outgoing edges for a vertex over the
minimal number of outgoing edges for a vertex (ratio), and the variance of the number
of outgoing edges (σ 2 ) for each input. These numbers provide a sense of how sparse and
how uneven the work is distributed per vertex. Some inputs are very balanced, such as
input I8 (hugebubbles). Others have more variance like input I12 (uk-2005).

Table 2.2: Scheduling Methods used to test. Parameters are based on those used by
BinLPT [3]
Scheduling Method
guided
dynamic
taskloop
binlpt
stealing
iCh

2.5.2

Parameters
chunk size = {1, 2, 3}
chunk size = {1, 2, 3}
num task = num threads
chunk size = {128, 384, 576}
chunk size = {1, 2, 3, 64}
ϵ = 25%, 33%, 50%

Self-Scheduling Methods
Table 2.2 provides a summary of the scheduling methods used in the result section.

Two commonly used self-scheduling methods in OpenMP, i.e., guided and dynamic, are

32

tested. Additionally, the newly added taskloop

5

scheduling method is tested as it has

provided good results in several cases, and is an active research area for irregular applications. The workload-aware method from BinLPT [3] (binlpt) is also compared against.
We do not compare against HSS and KASS as no source is available, but an in-house version of it was tested against binlpt, and HSS and KASS were shown to perform worst than
binlpt. The chunk sizes for dynamic, guided , and binlpt are the same used by BinLPT.
Additionally, the work-stealing method (stealing) that iCh is based on is evaluated to
demonstrate the improvement that an adaptive chunk size makes to the base algorithm.
Lastly, the iCh method is evaluated with three levels of ϵ (i.e., 25%, 33%, 50%).

2.5.3

Test system
Bridges-RM at Pittsburgh SuperComputing Center [40] is used for testing. The

system contains two Intel Xeon E5-2695 v3 (Haswell) processors each with 14 cores and
128GB DDR4-2133. Other microarchitectures, such as Intel Skylake, are also tested,
but results do not vary much. We implement iCh inside of GNU libgomp under the
GPL v3 License. All codes are compiled with GCC 8.2.0 except taskloop. All taskloop
applications are compiled with GCC 9.2.0 as the synth application requires a reduction
operator that is not implemented in GCC 8.2.0. OpenMP threads are bound to cores
with OMP PROC BIND=true and OMP PLACES=cores.
5

https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/

33

2.6

Results

In this section, we break the results into two subsections. The first subsection
focuses on the best general speedup that is obtainable by our test methods. The goal of
the first subsection is to provide a high-level view of how well each of the self-scheduling
methods could speed up the wide range of applications in this paper if some minor tuning
of parameters, such as chunk size, is done. In the first subsection, we report our metrics
using the best time over the tuning parameters, such as chunk size, listed in Table 2.2.
The second subsection will focus on parameter selection of iCh. In particular, it will
show the sensitivity of iCh to parameter selection and compare the impact of parameter
selection.

2.6.1

Speedup
We define T (app, schedule, p) as the best time of running application (app) using

the self-scheduling method (schedule) on p threads across all method parameters in Table
2. Additionally, we define speedup as

speedup(app, schedule, p) =

2.6.1.1

T (app, guided, 1)
.
T (app, schedule, p)

(2.9)

Synth

Figure 2.4 presents the speedup of the synth application. The first subplot
presents the linear workload, which is used in BinLPT [3]. In this workload, we see
that all self-scheduling methods perform about equally as well, and this finding matches

34

Figure 2.4: Synth Speedup. The speedup for a linear distribution (Linear), exponential
increasing distribution (Exp-Increasing), and exponential decreasing distribution (ExpDecreasing). We observe that iCh performs at about the same as the best method for all
three inputs. However, we notice that no method (besides dynamic, stealing, and iCh)
does well across all three inputs.

that by BinLPT. Using 28 threads, we notice that taskloop is the worst-performing. The
next subplot presents the exponential increasing order (Exp-Increasing). This means
that the iteration workload increases, and therefore, the first threads that are scheduled
will have the lightest workloads. In this case, guided , dynamic, and stealing are the best
self-scheduling methods, and they are about the same in terms of speedup. However,
we notice that binlpt no longer has good performance and taskloop struggles. The iCh
method is able to achieve a speedup close to the guided , dynamic, and stealing. Since
iCh is built around stealing, and stealing is one of the best self-scheduling methods for
this application and input workload, this difference in the speedup of stealing and iCh
provides insight into how much overhead the chunk size adjustment can have, which is
∼ 8.9% in terms of speedup. In the last subplot (Exp-Decreasing), the iteration workload
decreases, and therefore, the first threads that are scheduled will have the heaviest workloads. The self-scheduling methods of dynamic and stealing still perform well. However,
guided no longer does well and is even worse than taskloop. The performance of guided

35

is due to how guided assigns the largest chunk sizes of work to the first threads, and the
small chunk sizes later will not be able to rebalance the workload among the threads.
The iCh method again is not the best but is close (i.e., ∼ 6.5% difference in speedup
from stealing) to the best, indicating that the overhead of adapting the chunk size has
an overhead, but does not dramatically change performance.

2.6.1.2

Breadth-first search

Figure 2.5a presents the speedup of the BF application. The first subplot (Uniform) presents the speedup of BF applied to the graph when the number of neighboring
vertices generated from a uniform distribution. In this case, the best scheduling methods
are dynamic, iCh, and guided , in that order. We notice that stealing does not perform
as well as iCh even though it is the base algorithm for iCh, i.e., iCh is ∼ 9.6% better
in terms of speedup than stealing. This observation helps to demonstrate this work’s
conjecture that work-stealing based self-scheduling could be improved by having a chunk
size that will not interfere with stealing. The binlpt method does about as well as stealing, and taskloop does the worst. The second subplot (Scale-Free) presents the speedup
when BF is applied to a scale-free graph, i.e., the number of neighboring vertices follows
the power-law distribution. In this case, binlpt and iCh have the best speedups. We note
that the speedup for dynamic and guided did not generally go down, but binlpt and iCh
were just better able to improve the speedup. For binlpt, the performance increase seems
to be due to the secondary self-scheduling method, which is based on guided , performing better. For iCh, the performance was due to less overhead in chunk size adaption

36

(a) Breadth-first Search Speedup

(b) K-Means Speedup

Figure 2.5: Speedup for Breadth-first Search (a) and K-Means (b). Breath-first search
is tested with two inputs: Uniform and Scale-Free. We observe in (a) that iCh does
better than the stealing method alone that is used by iCh to help with imbalance. In
(b), we observe that only stealing and iCh can scale past 8 threads, and that iCh is again
able to outperform stealing alone.

(i.e., any overhead helped to yield better performance). More importantly, we observed
that iCh does much better than stealing (i.e., iCh is ∼ 54% better than stealing), and
demonstrates that the addition of adaptively changing chunk size pays off in performance.

2.6.1.3

K-Means

Figure 2.5b presents the speedup of the Kmeans application. This application is
known for not scaling well as the iteration workload is very uneven, and the workload
is changed for every outermost iteration. This makes any history-based self-scheduling
method that tries to learn the workload from a past iteration not useful, and the only
information that can be used is the information from that iteration itself. Because of
its highly irregular nature, the only self-scheduling methods that can obtain any real
speedup are iCh and stealing. For thread counts up to 8, all self-scheduling methods

37

perform equally as well. Based on experimenting with different thread placements, this
is due to memory pressure when threads are placed close and share cache space. However,
when a larger number of threads are used, such as 14 and 28, the memory pressure is
reduced and iCh and stealing can continue to speed up while the other methods do not
continue to scale. In this application, we again see that iCh is able to outperform the
standard stealing method by ∼ 12.3% in terms of speedup.

(a) LavaMD Speedup

(b) Spmv Speedup

Figure 2.6: Speedup for LavaMD and SpMV. For LavaMD (a), most self-scheduling
methods are about equal except taskloop and stealing. This demonstrates the improvement that iCh’s adaptive chunk size has over the base algorithm of stealing.

2.6.1.4

LavaMD

Figure 2.6a presents the speedup of the LavaMD application. Though this application is a molecular-dynamics application, the iteration workload is relatively well
balanced. We observe that almost all self-scheduling methods do about the same except
for stealing and taskloop. Both stealing and taskloop are unable to scale at all. For
38

stealing, this seems to be related to the few numbers of loops (i.e., 512 = 8 × 8 × 8) and
the relatively few chances stealing has to make up for: (a) trying to steal from a thread
that is locked up due to chunk size and (b) randomly selecting from nonoptimal choices
once the stealing thread tries to recover from the issue (a). However, the iCh method can
obtain speedup matching that of guided and dynamic. This is due to iCh quickly setting
a chunk size that would allow threads with small workloads to steal in order to rebalance.
This again demonstrates iCh’s adaptive chunk size improving the base implementation
of stealing.

2.6.1.5

Sparse Matrix-Vector Multiplication

The performance of the SpMV application varies greatly based on input. As
pointed out in the previous sections, a large variety of spare input matrices are tested
in order to obtain the wide range of performance of SpMV, and most SpMV like applications used to evaluate other self-scheduling methods tend to only focus on one sparse
matrix with a somewhat regular structure. Figure 2.6b presents the geometric mean of
speedups as bars. Additionally, the best and worst speedup for each scheduling method
on a particular thread count across all sparse matrix inputs is presented as upper and
lower whiskers of the bars. Over the whole test set, we observe that guided has the
best speedup with taskloop being the second best. Examining the best speedup over the
test set, stealing is able to obtain the best speedup (i.e., ∼ 21× on 28 threads) for the
wikipedia sparse matrix. On the other hand, guided obtains its maximal speedup on
28 threads (∼ 14.4×) on the patents sparse matrix. We noticed a similar trend for the

39

worst speedup. Overall, dynamic and binlpt perform very poorly. However, stealing also
does very poorly in the case of the sparse matrix hugebubble, and iCh can do better
than stealing in most cases. When comparing the sparse matrix variance (i.e., σ 2 in
Table 2.1) to the performance of iCh, a pattern starts to emerge. For sparse matrices
where variance is high (i.e., σ 2 > 4.8), iCh tends to do very well, i.e., it is either the
best self-scheduling method or very close. However, iCh does not do as well when the
variance is low. This provides some insight into why the overall geometric mean for iCh
is not as good for SpMV as in other applications as about half (i.e., 8/15 ∼ 53%) of the
sparse matrices have low variance, and thus, iCh does not perform well. It also provides
insight that iCh is not the best method if the user knows the workload has low variance,
as the overhead in iCh is too high.

2.6.1.6

Insight from all applications

Throughout these five different applications, iCh has performed well and been
one of the top self-scheduling methods using 28 threads. In particular, the speedup for
iCh is never more than ∼ 10% away from the best performing self-scheduling methods
for the first four applications. In the fifth application (SpMV), iCh is ∼ 17.5% worse
than the best self-scheduling method (i.e., guided ). However, iCh is still in the top three
best self-scheduling methods. Overall tested applications (including SpMV), the average
speedup difference of SpMV to the best performing self-scheduling method is ∼ 5.4%.
As such, iCh does accomplish its goal of being one of the best self-scheduling methods
over a wide range of irregular applications. However, if the application workload does

40

not have much variance (e.g., the work per iteration is almost uniformly distributed),
then iCh may not be the best choice.

2.6.2

Sensitivity
The previous subsection demonstrated that iCh could consistently provide scalable

performance for a large variety of irregular applications. However, many self-scheduling
methods have some parameters such as chunk size, and iCh has the parameter ϵ. Some
self-scheduling methods and application combinations are very sensitive to their parameters, e.g., selecting too small of a chunk size for dynamic may result in no speedup at
all. One such example of this is dynamic’s poor performance for the SpMV application.
In this subsection, we go deeper into determining the sensitivity of iCh to the parameter
ϵ and understanding how a bad choice of ϵ may result in poor performance of iCh. In
testing, iCh considers three values of ϵ, which are 25%, 33%, and 50%. We define the
metric:
max
ϵ sensitivity(app, p) =

T (app, iCh(ϵ), p)

ϵ∈{25%,33%,50%}

min

T (app, iCh(ϵ), p)

.

(2.10)

ϵ∈{25%,33%,50%}

This metric provides the ratio of the worst time of iCh over the best time of iCh over all
three ϵ choices. In particular, this metric will always be > 1, and a larger value indicates
a larger difference between the worst and best time. Note that the range of 25% to 50%
is a relatively small range when the user only considers whole number representations of ϵ
compared to the chunk size range that could exist for many other self-scheduling methods.
Additionally, we note that initial experiments with ϵ outside this range generally yields

41

Figure 2.7: Sensitivity. For each application, the ϵ sensitivity and worst stealing
ratios are provided to help analyze the sensitivity ofiCh to the ϵ parameter and measure
how ”bad” iCh could perform if a not using the ideal ϵ. The ϵ sensitivity (light blue)
is a ratio of the worst and best time for all tested ϵ. Therefore, the larger the difference
between the worst and best, the larger the ratio will be. The worst stealing is the ratio
of the time for iCh with the worst choice of ϵ over the best time for stealing. Therefore,
a value > 1 would indicate that the best stealing method is doing better, and a value
< 1 would indicate the worst iCh method is doing better.

bad results. In a more general sense, one can view ϵ as a parameter of the confidence
interval around the running average of loop iterations completed as in Figure 2.2.
Figure 2.7 presents ϵ sensitivity (i.e., light blue) across all the applications. For
synth with a linear iteration (Synth (Lin)) workload distribution, there is little to no
sensitivity in the time related to the choice of ϵ. However, this is not the case for the
other two distributions. In these two cases, the right choice of ϵ could result in up
to a 1.28× (e.g, ∼ 28%) increase of time on a single socket. Additionally, we notice
that the selection of ϵ becomes more important as the thread count increases. However,
even with a poor selection of ϵ, iCh is still able to be close to the best choice. For
example, ϵ sensitivity for synth on a decreasing exponential iteration workload (Synth
(Exp-Dec)) is ∼ 1.26× on 28 threads. However, the worst performing choice of iCh is

42

still performing ∼ 2.88× faster than guided . Overall, the best ϵ out of the three tested
for synth for all the workloads is 25%. BF has similar results as synth. That means
there is little difference between the performance of iCh due to ϵ when the workload
is somewhat evenly distributed (e.g., BF (Uniform)), and a sizable difference (∼ 1.22×
on 28 threads) when the workload is far less evenly distributed. For BF (Uniform), the
best ϵ overall thread counts tends to be 25%. For BF (Scale-free), the best ϵ overall
thread counts tends to be 33%. For Kmeans, LavaMD, and SpMV, we observe relatively
small differences in performance, i.e., under 1.10× difference. The best ϵ for Kmeans is
33%, and the best ϵ for LavaMD is 25%. The best ϵ for SpMV is highly dependent on
the sparse matrix input but tends towards 33%. Based on the above observations and
understanding of the application, we recommend that an ϵ of 25% be used on irregular
applications have somewhat uniform workload distribution (and relatively small amount
of work per iteration) and to increase ϵ towards 33% as the workload distribution become
more irregular (and relatively large amount of work per iteration).
As a self-scheduling method built around stealing, an important measure would
be how different is the performance of iCh when used with a poor selection for ϵ and
the performance of stealing if we knew a good chunk size. This type of analysis provides
insight into if iCh is worth using over stealing if a good parameter of ϵ is unknown. We
define the metric:
max
worst stealing(app, p) =

T (app, iCh(ϵ), p)

ϵ∈{25%,33%,50%}

min

T (app, stealing(chunk), p)

chunk∈{1,2,3,64}

43

.

(2.11)

Therefore, a value > 1 would mean that the best stealing method is doing better, and a
value < 1 would mean that the worst iCh method is doing better. Figure 2.7 presents this
metric (dark blue) for each of the applications. For synth, we observe a similar behavior
to that of the previous metric. This behavior indicates that iCh and stealing are doing
about the same when the loop workload is somewhat evenly distributed, but iCh with the
poorest choice of ϵ is not as good as the best stealing when the loop workload is not evenly
distributed. In more detail though, this behavior is not too big of a concern for synth as
stealing does very well on this application. For BF (Scale-free), the stealing method with
the best performance does much better than iCh with the worst performance on 8 and 14
threads (i.e., ∼ 1.48× and ∼ 1.32×), but the case is reversed on 28 threads (i.e., ∼ .56×).
Part of the reason for this turnaround is moving off the socket. At 28 threads, two NUMA
regions exist, and failure to steal from a queue on the same socket (particularly, across
the NUMA region) has a much larger penalty. For the other applications and inputs, iCh
with a poor choice of ϵ does about the same or better than stealing with the best choice
of the chunk size, which is a welcomed surprise and somewhat unexpected. This means
that iCh is the right choice, even if the user does not have an idea of the optimal ϵ value
as for many applications it can still outperform a tuned version of stealing.

2.7

Conclusion

In this chapter, we introduce a new self-scheduling method (called iCh) that aims
to provide adequate performance across a wide range of applications with as little tuning
as possible. As such, this method would be one that a user could select if they do not
have expert knowledge of the application and input, or they do not have time to fine-tune
44

parameters. The iCh method uses local queues to keep locally assigned loop iterations
and adapts the chunk size used to assign the active chunk of iterations a thread is working
on by using a cheap running approximation of iteration throughput. When a local thread
runs out of work, it steals work randomly. The adaptive chunk size allows the thread
to pace itself and tries to allow for the opportunity of other threads to steal from them
without failing due to too large of an active chunk. The iCh method is implemented into
the OpenMP runtime system of GCC.
In evaluating, iCh, five different applications are used, and several of these with
inputs that have very different workloads. These applications include: Synth (synth),
Breadth-first Search (BF), K-Means (Kmeans), LavaMD (LavaMD), and Sparse MatrixVector Multiplication (SpMV). The speedup of iCh is compared to that of the OpenMP
self-scheduling methods of dynamic, guided , and taskloop. Additionally, iCh is compared against the work-aware method known as binlpt, and compared against a generic
work-stealing method (stealing) used as the base algorithm for iCh. Across the various
applications, iCh demonstrates to be the only self-scheduling method that is constantly
good, as all other self-scheduling methods have at least one case where their tuned performance is worse than all the other methods. In particular, iCh is always within the
top three self-scheduler methods in all applications and is always within ∼ 10% of the
best speedup when using 28 threads (except for SpMV). Additionally, iCh is on average
∼ 5.4% away from the best speedup for all applications using 28 threads. In analyzing
the sensitivity to iCh’s only parameter ϵ, it is noted that ϵ could vary the performance
by as much as ∼ 1.26×. However, in most cases, the choice of ϵ is not important to
achieve good performance. Additionally, insight is provided on how to adjust ϵ if the
45

user has some basic understanding of the workload distribution (e.g., if the distribution
is somewhat uniform or very irregular).

46

CHAPTER 3

HETEROGENEOUS SPARSE MATRIX-VECTOR MULTIPLICATION

3.1

Introduction

This work has been submitted to the Journal of Parallel Computing with me as the
first author and my advisor, Dr. Joshua Booth, as the second author. My contributions
included writing the SpMV kernels for all CPU and GPU testing, devising the tuning
methodology, running the test suite, data analysis, and helping with writing and figures.
SpMV, i.e., y ← Ax where A is a sparse coefficient matrix and x, y are dense
vectors, is well-known to be a notoriously difficult kernel to optimize because of its
memory bandwidth requirements [22,41,42]. The primary challenge in optimizing SpMV
is its relatively poor performance on devices due to low computational intensity, i.e.,
the number of floating-point operations per memory load is low. Recalling the general
algorithm for SpMV, for each row (i) of A the nonzero elements are multiplied by their
corresponding elements in x and added together for a single element in y (yi ). Each of
these nonzeros in a row results in three loads (i.e., ai,j , xj , and yi ) and one store must
occur per multiply-add operation. Additionally, there exists little data reuse as each ai,j
is only used once and yi is only used for each row. Making matters even more difficult, the
one array that does see reuse (i.e., x) might be accessed effectively at random, depending
47

on the structure of the matrix. This makes it extremely difficult for cache structures
to consistently hold relevant data, and places SpMV firmly on the bandwidth-limited
roofline of the roofline model [13]. Figure 3.1 sketches the roofline model for the GPU
NVIDIA Ampere A100, following the observations of [43] and our data in Section 3.6.
Oftentimes, SpMV may only see a small fraction of the peak performance of a compute
device (often around 10%) because a very proportionally large amount of time is spent
waiting for memory access [44].

Figure 3.1: An example roofline model for GPUs for the NVIDIA Ampere 100. SpMV
is restricted by its low arithmetic intensity, and hence low FLOP/s.

Currently, SpMV formats such as coordinate list (COO) and compressed sparse
row (CSR) [45, 46] are used to reduce the memory overhead of storing sparse data. This
reduction aids large sparse matrices to better fit into the main memory and may reduce
the number of loads needed to keep data in higher cache levels. However, they are not
designed in consideration of parallel processing and the complex microarchitecture of
modern devices, such as nonuniform memory accesses (NUMA) on many-core systems.
A second challenge is the complexity of the optimization. This complexity comes in the

48

forms of the time to optimize and overhead in the size of the data structure. For example,
the process of autotuning SpMV was popular in the early 2000s [21, 47], and the cost for
tuning SpMV on many-core systems by constructing complex multilevel block structures
could be amortized over the length of the application. However, both the computational
device and the number of concurrent threads could theoretically change per iteration of
the application in modern heterogeneous systems, and the cost of autotuning would be
too high on such systems.
Therefore in this paper, we extend CSR-k for heterogeneous systems. This extension has the following significant benefits. The first is that CSR-k has a very small storage
overhead compared to CSR. Since CSR-k is an extension of the CSR format, we analyze
the overhead of additional storage and find that the overhead is always < 2.5% Secondly,
the format is easy to understand by application scientists as it does not require bit-level
indexing or specialized hash tables (unlike other heterogeneous SpMV formats [22, 48]).
As such, it is easy to integrate into existing codes and visualization software. This is
important for computational scientists looking to convert their CSR-based kernels to
CSR-k-based kernels. Lastly, the tuning time is far smaller than other methods (e.g.,
pOSKI [21,47]) as the parameter space is much smaller, and, as we observe in the paper,
the tuning can be done in constant time after training a simple model.
This thesis provides the extension of CSR-k for heterogeneous systems, in particular CPUs and GPUs, in the following manner:
• Background and related work is reviewed in Section 3.2

49

• An overview of the implementation changes needed for CSR-k on NVIDIA GPUs
is given in Section 3.3
• The first model-driven selection of tuning parameters for the CSR-k format to
provide constant-time parameter selection is given in Section 3.4
• A comparison of CSR-k to KokkosKernels [49] and NVIDIA’s cuSPARSE on Volta
V100 and Ampere A100 is showcased in Section 3.6
• A comparison of the impact of sparse matrix band reordering on performance is
shown in Section 3.6.1
• A comparison of CSR-k to Intel MKL on Xeon Platinum 8380 and Epyc 7742 is
given in Section 3.7
• An analysis of overheads for CSR-k is given in Section 3.8

3.2

Background

In this section, we present some of the most common implementation formats for
SpMV on CPU and CPU/GPU heterogeneous formats. We also present some current
trends in autotuning, mainly through automated format selection, and why existing
techniques fall short. Through this presentation, we outline the strengths and weaknesses
of each and the current state.

50

3.2.1

Formats for CPU
Though many SpMV formats have been studied [6, 45, 46, 50–52], two standard

formats have become popular for SpMV due to reducing memory overheads and their
relative ease of understanding, i.e., Coordinate list (COO) and Compressed Sparse Row
(CSR). COO is the most straightforward format for storing a sparse matrix. The COO
format contains three arrays (i.e., col idx, row idx, and vals) each array is the length
of the number of nonzeros (NNZ) in the sparse matrix. However, several drawbacks exist
for utilizing COO on many-core systems. These drawbacks include the overall storage
required is 3 × N N Z × 32 bits if we consider 32 bit integers and single-precision floatingpoint values, and the format itself does not provide any ordering of how the nonzero values
should be processed. As a result, a SpMV of a sparse matrix that is stored in COO based
on a random permutation of col idx and/or row idx would have poor performance, and
lock or atomic operations would be needed for parallel implementations.
CSR is designed to be a more space-efficient data structure than COO. It does
this by blocking nonzero elements into shared rows. The format contains three arrays
(i.e, row ptr, col idx, and vals). The row ptr array provides the running cumulative
sum of the nonzero elements in each row and hence has size m + 1 for an m × n sparse
matrix. The col idx provides the column index and the vals provides the values for
each of the nonzero elements. As such, the CSR requires (2 × N N Z + m + 1) × 32 bits
of data. Figure 3.2 provides an example of this format in black (ignoring SR and SSR).
Sparse blocked based formats, such as Block Compressed Sparse Row (BCSR) [50,
51] or Unaligned Block Compressed Sparse Row [52], take the idea of CSR into a second

51

Figure 3.2: Example of the CSR-3 data structure with the super-row pointer (sr ptr)
and super-super-row pointer (ssr ptr). Super-rows and super-super-rows are shown to
the right of the matrix.

dimension. Nonzero elements are grouped together in two-dimensional blocks that are
normally dense. Many sparse matrices from finite-element analysis often exhibit this
dense block substructure. These blocks are next grouped together in some outer blocking
structure, such as by row in BCSR. These can be highly optimized due to the number
of different parameters they provide, e.g., the number of rows and columns in a block.
However, the performance of BCSR depends on grouping together nonzero elements
into a block. Though these formats can reduce the memory needed to store the sparse
matrix, the true advantage of these formats is having smaller structures and memory
access patterns that better fit the hierarchical cache memory structures in many-core
systems. In particular, these small dense blocks can better fit into L1 or L2 caches.
In the application space, SpMV formats are hard to separate from reorderings.
Sparse reorderings permute nonzero elements, and these reorderings can provide better

52

memory accesses [31], improve iteration counts of sparse iterative solvers [53], and can
provide better groupings of nonzeros, such as those taken advantage of by many block
based formats and supernodes in sparse direct methods. An example of these orderings include reorderings that reduce the band around the diagonal, i.e., orderings that
pull nonzero elements toward the diagonal, such as spectral orderings [54] and Reverse
Chuthill-McKee (RCM) [31]. Therefore, it is almost always the case that some reordering
is applied to the sparse matrix in conjunction with selecting a storage format.

3.2.2

CSR-k
One CSR based format that has shown to greatly improve parallel performance

and be more aware of the hierarchical cache structure is CSR-k [6, 55]. CSR-k is a
multilevel data structure for storing sparse matrices based on CSR. This format utilizes
both multilevel structures that better fit the cache structure of many-core systems and
multiple levels of reorderings to reduce the envelope or band size for a level. In particular,
the k represents a small integer ≥ 2 that defines the number of additional arrays that store
data about rows (i.e., the number of additional arrays equals k-1). Figure 3.2 provides
an example of CSR-3. In addition to the arrays for CSR, CSR-3 has two additional
arrays. These arrays are sr ptr and ssr ptr that contain pointers to the super-rows
and the super-super-rows. Super-rows are groups of contiguous rows, and the array is
compressed in a similar manner to how row ptr compresses information on the number of
contiguous nonzeros in a row. Therefore, sr ptr=[1, 3, 6] represents a sparse matrix
with two super-rows: one spanning rows 1-2, and one spanning rows 3-5. This method is
extended up to super-super-rows that group together contiguous super-rows. As a result,

53

the CSR-k format’s only memory overhead is the storage of these additional arrays.
Moreover, applications and libraries that have not been fitted to utilize CSR-k, such as a
visualization tools or checkpointing, can still process the format as they would a normal
CSR format without the overhead of storing both. As such, this makes it an ideal base
format for heterogeneous systems as most have support for CSR, even if not for CSR-k.
For completeness, we provide the pseudocode for SpMV with CSR-3 on a manycore system in Listing 3.1. Lines 4-5 parallelize the outermost for loop, which iterates
over super-super rows (e.g., num ssr = 3 in Fig 3.2). In lines 6-9, the bounds for one
super-super-row are fetched and iterated across (e.g., ssr ptr in Fig 3.2). Similarly, lines
10-13 fetch the bounds for one super-row and are iterated across (e.g., sr prt in Fig 3.2).
Finally, lines 14-21 implement the actual work in a similar manner to a CSR-based kernel.
Listing 3.1: SpMV-3 CPU kernel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

function SpMV 3 ( num ssr , s s r p t r [ ] , s r p t r [ ] , r p t r [ ] ,
col idx [ ] , vals [ ] , x [ ] , y [ ] ) {
#pragma omp p a r a l l e l f o r
f o r i = 0 to num ssr {
let s s r s t a r t = s s r p t r [ i ]
let ssr end = s s r p t r [ i + 1]
fo r j = s s r s t a r t to s s r e n d {
let s r s t a r t = sr ptr [ j ]
let sr end = sr ptr [ j + 1]
for k
let
let
let

= s r s t a r t to s r e n d {
r start = r ptr [k ]
r end = r p t r [ k + 1]
temp = 0 . 0

f o r l = r s t a r t to r e n d {
temp += v a l s [ l ] ∗ x [ c o l i d x [ l ] ]
}
y [ k ] = temp
}
}
}
}

54

CSR-k supports multiple orderings. In particular, one for SpMV and one for
sparse triangular solve as both require very different access patterns and have different
structures (i.e., the ordering for SpMV tries to reduce access distance and the ordering
for sparse triangular solve tires to minimize dependencies in the elimination graph). The
ordering used by CSR-k for SpMV is called Band-k. This ordering applies band-reducing
orderings (e.g., RCM) to the multiple levels the super-super-row and super-row structure
provides. In order to apply these band reducing orderings, a graph representation of the
rows (i.e., each row is a vertex and each nonzero is an edge) is generated. The vertices
are coarsened into super-rows and super-super-rows, and the band reducing ordering is
applied to each level of the coarsened row from the highest level (i.e., super-super-row for
CSR-3) to the lowest level (i.e., row). This process normally produces an ordering with
a band wider than that produced by RCM, but better fits the format structure. As such,
we utilize the standard Band-k format for our extension as well. Moreover, the multiple
levels of the coarsened-band ordering aims to try and help with load imbalance.

3.2.3

Formats for GPU
One notable format has a long history of being pushed on GPU, namely ELLPACK

(ELL) format [56] and its derivatives. ELL format is a blocking format, which takes an
m × n sparse matrix and stores it as two m × k dense matrices, where k is the number of
nonzeros in the densest row of the sparse matrix. One matrix stores the nonzeros shifted
left, padded on the right with zeros. The other matrix stores their corresponding columns
also padded with zeros. ELL and its many variants were used heavily for GPU-based
SpMV in the early days of GPGPU, due to its friendliness to vector architectures [57,58],

55

but suffer from the issue of excess memory overhead. For instance, if an irregular sparse
matrix had the densest row containing 40 nonzeros, but an average nonzero count of 10,
the ELL format would incur a 300% memory overhead. Due to the fact that GPUs often
have much less memory than is allotted to the CPU, this means that many large irregular
matrices that would fit with CSR format cannot fit using ELL format. Because of the
potentially astronomical memory overhead incurred by ELL and its derivatives, we find
ELL to be a failure on GPU.

3.2.4

Heterogeneous Formats
Heterogeneous formats are those that are designed to fit multiple different com-

putational devices. These have become popular due to the importance of heterogeneous
computing systems. Several popular formats stand out in this area. The first is the use
of blocked sparse format on GPU devices. Eberhadt and Hoemmen [51] demonstrate
that BCSR is a reasonable format for both many-core CPUs and GPUs when the sparse
matrix contains some block substructure. However, a specialized algorithm for SpMV on
GPU must be reworked for the particular device such as their by-column algorithm in
order to out-compete NVIDIA’s built-in sparse matrix library cuSPARSE. On the other
hand, Liu and Vinter [22]’s CSR5 introduces a CSR based format that utilizes a number
of tiles to improve performance on both many-core CPUs and GPUs. These tiles are of
size σ by ω and are autotuned to better fit the SIMD nature of the device. The elements
in the col idx are ordered in a way to fit these tiles and additional vectors (i.e., a title
pointer and a title descriptor with lengths that are based on the choice of σ and ω) are
kept for title information along with a bit-flag the length of the number of nonzeros in

56

the sparse matrix. Therefore, the format is tuned to the device, and storing the spare
matrix for additional devices would require keeping additional title pointers and title descriptors. Utilizing this format, CSR5 can obtain on average the maximum performance
of either standard CSR or the autotuned pOSKI [47] on many-core CPU systems and is
able to obtain on average the maximum performance of all tests kernels on GPUs. As
such, the overheads for storing in CSR5 are somewhat similar to that of CSR-k, however, it is arguably a much more complex structure to understand and implement with
so many tile level arrays, title pointer, title descriptors, and requiring bit-wise computations. Lastly, work by Aliaga et al. [48] examines the use of a compressed COO format.
They demonstrate a truly heterogeneous format that can be stored in the same format
for both many-core CPUs and GPUs. This format’s performance on CPU is on average
better than Intel MKL’s CSR utilizing 16 AMD Epyc cores and is better on NVIDIA
V100 and A100 than cuSPARSE’s COO. However, this format compresses values into
integer values using a look-up table (LUT) and requires bit-wise computation for determining blocking. As such, both these methods are difficult for application scientists that
may want to tune or modify an application.

3.2.5

Autotuning SpMV
Many methods for autotuning SpMV have been explored. For instance, recent

literature has been investigating automatic format selection for SpMV. Zhao, et al. have
proposed using deep neural networks for automatic SpMV format selection [12]. While
their method shows good performance, the start-up cost is astronomical, with the first
suite of tests and training taking about 75 hours. While the training itself only takes

57

about 27 minutes, they use a dataset of 9200 matrices, each one needing to be run several
times with many different formats. A couple other works have been proposed that do
not require a deep learning model for automatic format selection [59, 60], though their
accuracy is slightly lower than Zhao, et al.’s work. These works use decision-based models
in if-then format. However, all aforementioned models require running thousands of
matrices under different formats, taking a very large amount of time. Many current
vendor-provided sparse linear algebra libraries, such as NVIDIA cuSPARSE and Intel
MKL perform some sort of runtime tuning, though because details are not available to
the public, we cannot comment on the techniques they use. Evidence for this is explored
in Section 3.5.4. pOSKI [47] also shows some runtime tuning, though the original CSR-k
paper demonstrates how cheap CSR-k’s tuning is in comparison [6]. We improve on this
tuning further in this work.

3.3

CSR-k for Exploiting Multidimensional Block Structure in NVIDIA
GPUs

In this section, we provide an overview of how to extend CSR-k for NVIDIA
GPUs utilizing CUDA. We note that there is nothing different in the storage format of
CSR-3 for a GPU and the storage format of CSR-3 for a CPU that is introduced in the
previous section. The difference is how we interpret the hierarchical levels within CSR-k
and how this optimally gets implemented in the GPU algorithm. As noted before, the
multiple levels in the CSR-k for a CPU can be interpreted as blocking sets of rows into
super-super-rows and super-rows that better fit the hierarchical level cache structure on
many-core systems. The algorithm implementation of this is straightforward with each
58

level providing an additional level of for-loops over the super-super-rows and super-rows.
Listing 3.1 provides an example of this code for CSR-3 on a CPU.

Figure 3.3: Diagram of the NVIDIA GPU memory hierarchy. A global shared memory
(e.g., HBM2E) is at the bottom. A more standard L2 is shared by streaming multiprocessors (SM). Each SM has its own L1 that can act as a private data cache or shared
memory. Each FP is a floating-point unit.

However, NVIDIA GPUs utilizing CUDA do not have the same types of hierarchical level cache structure as a CPU. Figure 3.3 provides a representation of the memory
and execution structure of a NVIDIA GPU. The GPU has a large (e.g., 80GB) shared
global main memory consisting of HBM2 or HBM2E high-bandwidth memory for higher
end systems, a shared L2 cache, and a memory layer that can be partitioned as a private
L1 data cache or shared memory for each streaming multiprocessor (SM). We also note
that these caches, such as the L1 data cache are much smaller per thread than a traditional CPU cache. For instance, AMD Rome (e.g., AMD Gen2 Epyc) has 32 KB per
core of private L1 data cache, while the NVIDIA Ampere A100 has 192 KB per SM of
L1 data cache. Since each SM has 64 32-bit floating-point units (FP), that equates to an
59

average of 3 KB/thread, though this cache is shared across all FP execution units in an
SM. Additionally, how these resources are scheduled are fundamentally different than a
CPU. Each thread on a CPU can be fixed or pinned to a particular computational core
and remain there until it has completed its work. In the CSR-k case, this work could be
the execution of a particular set of super-rows. However, 32 threads are blocked together
to form a warp on these GPUs. It is more convenient to think of this warp as a single
SIMD intrinsic on CPUs. Groups of these threads/warps combine to form a thread block,
and a thread block supports at most 1024 threads. These thread blocks are assigned to
an SM and will not migrate to other SMs (i.e., they are fixed or pinned). Therefore, it
is easier to think of each these blocks as an assignment of a core to a super-row. This
mapping only goes so far as each SM may have multiple thread blocks and the order of
execution of these thread blocks is unknown. These blocks are further coarsened into a
grid. Due to how the scheduler is implemented, most efficient CUDA kernels are mapped
into a computational grid or cube. This allows better exploitation of the 3D nature of
CUDA blocks and the grid. Utilizing this structure often aids in better locality of data,
since the CUDA runtime is able to schedule threads close in a block closer on the die.
Using this cubic environment, we can map a multilevel data structure efficiently to a 2D
or 3D block.
Therefore, we can utilize the hierarchical structure of the GPU even if not directly
related to the cache level size for CSR-k, and in the next section we justify super-superrow and super-row sizes based on this argument if we utilize CSR-3 format. As such, we
will only focus on a k = 3 for these NVIDIA GPUs. We start by assuming that an SM is
a core or shared-memory computational area in a standard CPU with a shared memory
60

Figure 3.4: CUDA grid and block layout for GPU SpMV algorithms GPUSpMV-3 and
GPUSpMV-3.5 in relationship to CSR-3.

cache, i.e., the L1 data cache, even though each SM may have multiple blocks mapped to
it and contains multiple dispatch units. This rationality is reasonable as a core may work
on multiple super-super-rows and these super-super-rows should be issued as a chunk.
Therefore, we can view a single block as a super-super-row each being assigned to an
SM. As such, we build a grid of these blocks, i.e., a 1D grid. We note that a single block
can only support up to 1024 threads, which will put a cap on the maximum size of our
super-super-rows. Within a block, the next level (i.e., super-row) loop can be mapped
to the y-dimension and the row can be mapped the x-dimension for form 2D blocks. We
note that the order that x- and y-dimensions are structured in the code is important for
GPU dispatch. Warps are grouped first by thread adjacency in the block x-dimension,
then the y-dimension, then the z-dimension. Therefore, organizing the threads that work
on the inner loops (which work on spatially adjacent data) along the x-dimension first is
important for data locality. We call this base code GPUSpMV-3.
Figure 3.4 provides a diagram of the CSR-3 format’s layout in terms of CUDA
thread blocks. As such, super-super-row 0 (SSR0) will map onto Block 0. The y-

61

dimension of the block will be mapped on the the super-row (SR), e.g., y0 maps to
SR0 and y1 maps to SR1. The accompanying pseudocode is provided in Listing 3.2. We
note that this code is relatively simple and easy to implement in CUDA as it relies on
the simple CSR-3 structure and CUDA block parallel functions.
Listing 3.2: GPUSpMV-3 GPU kernel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

function cuSpMV 3 ( n u m c o a r s e s t r o w s , s s r p t r [ ] , s r p t r [ ] ,
r ptr [ ] , col idx [ ] , vals [ ] , x [ ] , y [ ] ) {
let block = blockIdx . x
let s s r s t a r t = s s r p t r [ block ]
let sr end = s s r p t r [ block + 1]
f o r i = s s r s t a r t to s s r e n d
p a r a l l e l i z e across blockDim . y {
let s r s t a r t = sr ptr [ i ]
let sr end = sr ptr [ i + 1]
fo r j = s r s t a r t to s r e n d
p a r a l l e l i z e across blockDim . x {
let r s t a r t = r ptr [ j ]
let r end = r p t r [ j + 1]
l e t temp = 0 . 0
f o r k = r s t a r t to r e n d
no p a r a l l e l i z a t i o n {
temp += v a l s [ k ] ∗ x [ c o l i d x [ k ] ]
}
y [ k ] = temp
}
}
}

The last for-loop, i.e., the inner product of the sparse matrix row and the vector, (Line 19-21 in Listing 3.2) in the algorithm provides an additional level of possible
parallelism. The original CSR-k paper did not consider this level of parallelism because
of overheads with aligning and data movement for SIMD intrinsics in the test hardware.
As noted in the last section, CSR5 explicitly lays out its tiles for these SIMD operations
but at the cost of complexity of the format. This level of parallelism is important and
62

needs to be addressed in GPU codes. However, if the number of nonzeros per row is
relatively small (e.g., < 8), utilizing this level of parallelism would reduce the available
number of overall threads to a block and would require the overhead of utilizing part of
the L1 data cache for shared memory. Through experimentation, we discovered that 8
nonzero elements per row is what is required to improve perform with parallelization in
this level. We denote the algorithm with the parallelism of the last level as GPUSpMV3.5, and pseudocode is provided in Listing 3.3. Figure 3.4 also provides the structure for
GPUSpMV-3.5. However, SSR are now the blocks, the SR are the z-dimension, the rows
are the y-dimension, and nonzeros in the rows are the x-dimension.
Listing 3.3: GPUSpMV-3.5 GPU kernel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

function cuSpMV 3 . 5 ( n u m c o a r s e s t r o w s , s s r p t r [ ] , s r p t r [ ] ,
r ptr [ ] , col idx [ ] , vals [ ] , x [ ] , y [ ] ) {
let block = blockIdx . x
let s s r s t a r t = s s r p t r [ block ]
let ssr end = s s r p t r [ block + 1]
f o r i = s s r s t a r t to s s r e n d
p a r a l l e l i z e across blockDim . z {
let s r s t a r t = sr ptr [ i ]
let sr end = sr ptr [ i + 1]
fo r j = s r s t a r t to s r e n d
p a r a l l e l i z e across blockDim . y {
let r s t a r t = r ptr [ j ]
let r end = r p t r [ j + 1]
l e t temp [ blockDim . x ] = f i l l ( 0 . 0 )
f o r k = r s t a r t to r e n d
p a r a l l e l i z e across blockDim . x {
temp [ t h r e a d I d x . x ] +=
vals [ k ] ∗ x [ col idx [ k ] ]
}
y [ k ] = p a r a l l e l r e d u c t i o n ( temp )
}
}
}

63

For optimization purposes, we do a very aggressive loop unrolling for the innermost
loop of GPUSpMV-3. The temp array in GPUSpMV-3.5 is allocated as a block-local
shared memory space for fast read and write access.

3.3.1

Banding
The GPU algorithm still utilizes the banding of Band-k provided in [6]. While

debate may exist on if ordering is important for GPU as it is with CPU, we find that it
does. On GPU, banding can provide a significant advantage as opposed to a naturally
ordered matrix. In Section 3.6, we provide evidence that ordering matters for GPU
SpMV for CSR-k and one competitor SpMV library.

3.4

3.4.1

Tuning Optimal Structure

GPU
When tuning CSR-k for CPUs normally a low cost autotuning method is needed,

and the original paper notes how inexpensive this autotuning is compared to pOSKI.
Autotuning on a GPU opens up many new parameters, such as block dimensions and if
the inner product should be parallelized (e.g., GPUSpMV-3 vs GPUSpMV-3.5). However,
there exist some standards that can help guide this tuning, and these standards can even
be reduced to a closed form heuristic formula that can determine the super-row and
super-super-row sizes in constant time for a given sparse matrix after some initial tuning
on a given device.

64

The first of these standards is based on the relationship between the number of
threads and block size. Remember that only 1024 threads can be used per block and that
32 threads are launched at a time in a warp. Therefore, the dimensions of the block should
be based off of multiples of 32. The second of these standard is having enough work to
keep each thread busy, but limit the working memory size of each thread so accesses are
not hitting slower caches or main memory. Additionally, making more threads is more
likely to amortize any imbalance in work each thread might have. The average amount of
work a single row must compute (in GPUSpMV-3) is based on the average row density,
i.e., rdensity = N N Z/N where N N Z is the number of nonzeros and N is the number
of rows. Additionally, we have experimentally determined as noted before that serial
computation of the inner product per row only makes sense when rdensity < 8. Based
on these two facts and common block dimensions used in GPUs, we only need to consider
the following cases:
Case 1: rdensity ≤ 8
The block dimensions are set as 8 × 12.
Case 2: 8 < rdensity ≤ 16
The block dimensions are set as 4 × 8 × 12.
Case 3: 16 < rdensity ≤ 32
The block dimensions are set as 8 × 8 × 8.
Case 4: 32 < rdensity ≤ 64
The block dimensions are set as 16 × 8 × 4.

65

Case 5: 64 < rdensity
The block dimensions are set as 32 × 8 × 2.
We can interpret these values as follows. In Case 1, four rows along the y-dimension (i.e.,
8 × 4 threads) would make up a warp. Since the y-dimension corresponds to a super-row
in GPUSpMV-3, a single warp would have four super-rows. This may seem like a lot of
work, but the rdensity is relatively small in this case. On the other hand, in Case 5,
one row along the y-dimension would make up a warp due to the number of number of
nonzeros in a single row needing to be computed.
Once these standard block sizes are set, empirical runs can be completed with
different sizes of super-rows and super-super-rows over a suite of sparse matrices. We can
use these runs with the following modeling method to determine a closed form heuristic
for selection super-super-row and super-row sizes in the future. Using these test runs, we
perform a logarithmic regression over the dataset, with the x-values being rdensity and
the y values being the optimal super-super-row or super-row sizes. The SSRS (supersuper-row size) and SRS (super-row size) parameters are tuned independently. That is,
during testing, all reasonable combinations of SSRS and SRS are tested (some set of
representative sizes between 4 and 48 for GPU) and the regression is performed twice:
one on the optimal SSRS parameters, and one on the optimal SRS parameters. The
specific set of combinations for SSRS and SRS tested are described as follows:

(SSRS, SRS) ∈

5
[


!2
2i , 1.5 · 2i

i=2

where ϕ2 is shorthand for the set Cartesian product ϕ × ϕ.
66

Finally, since the logarithmic distribution tends to yield a formula that drops much
below optimal when rdensity becomes large, the coefficient of the natural logarithm was
lowered by hand to better fit the optimal SSRS and SRS with high rdensity.
From this method, we achieve the following formula for tuning on the NVIDIA
Volta V100:
SSRS = ⌊8.900 − 1.25 · ln (rdensity)⌉

and
SRS = ⌊10.146 − 1.50 · ln (rdensity)⌉.

where ⌊x⌉ represents a round-to-nearest, half towards positive infinity operation.
These numbers would have to be derived for each different machine that has considerable microarchitecture differences. For NVIDIA Ampere, a slight tuning is required.
The same process is used:

SSRS = ⌊9.175 − 1.32 · ln (rdensity)⌉

and
SRS = ⌊20.500 − 3.50 · ln (rdensity)⌉.

As with Ampere, the coefficient of the natural logarithm is lowered so that the
SSRS and SRS do not drop too low on the higher row densities. The constant term is
unchanged in both instances.
After the initial SSRS and SRS are determined, they can be further tuned based
on rdensity. In particular, the SSRS and SRS need to be updated to account from their
67

derivation from the original assumption of utilizing GPUSpMV-3 to GPUSpMV-3.5, in
addition to accounting for the changing block dimensions as rdensity grows. As such,
they are modified as follows on NVIDIA Volta:
Case 1: rdensity ≤ 8
Tune SSRS and SRS no further.
Case 2: 8 < rdensity ≤ 16
SSRS = ⌊SSRS × 1.5⌉
SRS = SRSS × 2.
Case 3: 16 < rdensity ≤ 32
SSRS = SSRS × 4
SRS = ⌊SSRS/2⌋.
Case 4: 32 < rdensity
SSRS = SSRS × 5
SRS = ⌊SSRS/2⌋.

68

And as follows on NVIDIA Ampere:
Case 1: rdensity ≤ 8
Tune SSRS and SRS no further.
Case 2: 8 < rdensity ≤ 16
Tune SSRS no further.
SRS = SRSS × 4.
Case 3: 16 < rdensity ≤ 32
SSRS = ⌊SSRS × 2.5⌉
SRS = SSRS × 3.
Case 4: 32 < rdensity ≤ 64
SSRS = SSRS × 2
SRS = SSRS × 2.
Case 5: 64 < rdensity
SSRS = ⌊SSRS × 2.7⌉
SRS = ⌊SSRS/4⌉.

3.4.2

CPU
On CPU, CSR-2 is used following the findings of [6]. When running CSR-2 on

a CPU, the kernel is much less dependent on having optimal SSRS and SRS within a
narrow window for good performance. Additionally, the added complexity of the cache
hierarchy in CPUs as opposed to GPUs makes tuning much more difficult to do optimally,

69

as there is no clear correlation between optimal SSRS/SRS and known matrix attributes
without analyzing matrix structure. Due to these combinations of factors, we find that in
an ideal scenario, each matrix would be tuned individually, using a set of representative
SRS between 8 and 3072, which are specified as the set:

SRS ∈

11
[


2i , 1.5 · 2i .

i=3

The combined effect of larger caches on CPU and using CSR-2 instead of CSR-3 means
that much higher SRS are often preferred. Namely, most matrices on CPU using CSR-2
prefer SRS in the range of 40-1000, as opposed to most matrices on GPU using CSR-3
preferring SSRS/SRS in the range of 4-12.
If constant-time tuning is preferred at the expense of performance, we have found
that taking the geometric mean of optimal SRS across a representative dataset yields
decent performance. For instance, one might use a SRS = 96 for all matrices on CPU to
yield performance that is often “good enough.” This is explored empirically in Section 3.7.

3.5

Test Setup

In this section, we provide the test setup.

3.5.1

Test Systems
The test setup consists of four systems. System 1 is used for GPU testing and

has two Xeon E5-2650v4 CPUs (“Broadwell”) with 12 cores each, and contains several
NVIDIA V100 GPUs (“Volta”). These GPUs contain 32 GB of main memory, and peak

70

Table 3.1: All systems we use for testing.
System
1
2
3
4

Label
Volta
Ampere
Rome
Ice Lake

CPU
Intel Xeon E5-2650v4
AMD Epyc 7713
AMD Epyc 7742
Intel Xeon Platinum 8380

Physical Cores
24
128
128
80

Memory
128 GB
1 TB
256 GB
256 GB

GPU
NVIDIA V100
NVIDIA A100
None
None

GPU Memory
32 GB
40 GB
None
None

memory bandwidth of 900 GB/s. System 2 is also used for GPU testing and has two Epyc
7713 CPUs (“Milan” or “Zen 3”) with 64 cores each, and contains NVIDIA A100 GPUs
(“Ampere”). These GPUs contain 40 GB of memory, and a peak memory bandwidth
of 1555 GB/s. On both GPU systems, only one GPU is used. System 3 is for CPU
testing. It contains two Epyc 7742 CPUs (“Rome” or “Zen 2”) and 256 GB of memory
in 8 memory channels. Simultaneous multi-threading (SMT) is disabled for system 3.
System 4 is also for CPU testing, and contains two Intel Xeon Platinum 8380 CPUs (“Ice
Lake Server”) and 256 GB of memory. SMT is enabled for system 4, but we opt not to
use it in tests. Table 3.1 provides details on each of these systems.

3.5.2

Tested Libraries
For GPU experiments, GPUSpMV-3 and GPUSpMV-3.5 are both implemented in

C++ with CUDA and compiled with the NVIDIA’s NVHPC compiler using compilation
tools version 11.4. Our CSR-k SpMV implementations are compared against cuSPARSE
v11.4 and KokkosKernels v3.4.1 [49]. NVIDIA’s cuSPARSE is an optimized sparse linear
algebra library written for use on NVIDIA GPUs. It provides SpMV implementations in
a number of formats, and we compare against the library’s CSR implementation. NVIDIA’s cuSPARSE is part of the CUDA runtime, so all runs are performed using CUDA

71

runtime 11.4, which has support for both Volta and Ampere GPUs. KokkosKernels
is a computational mathematics library published by Sandia National Laboratories. It
supports sparse and dense linear algebra, among other kernels, on both CPUs and GPUs.
The goal of KokkosKernels is to provide performance portable code to a number of parallel
runtime libraries without requiring the user to code in the different libraries. KokkosKernels is compiled for the SM 70 compute architecture, which is supported by our
Volta GPUs. The tested version of KokkosKernels does not support a build parameter
for SM 80, i.e., the compute architecture used by newer Ampere GPUs, and therefore
will only be tested on Volta.
For CPU experiments, we utilize Intel MKL v19.1.3 on system 3 and Intel MKL
v19.1.1 on system 4. Intel MKL (Math Kernel Library) is a computational mathematics
library that supports highly-optimized sparse linear algebra on CPUs. CSR-k is compiled
on system 3 with the AMD Optimizing C Compiler (AOCC) v2.3.0 with -O3 optimization
and -march=znver2. CSR-k is compiled on system 4 with the Intel v19.1.1 compiler with
-O3 -ipo optimization and -march=icelake-server. On both system 3 and system 4,
compiler pragmas are used to enable the maximum vectorization width for the innermost
loop (8-way SIMD on system 3 and 16-way SIMD on system 4) for CSR-k. Additionally,
OpenMP scheduling parameters are set to schedule(static) for CSR-k on systems 3
and 4.

3.5.3

Test Suite
We use 16 representative sparse matrices from the SuiteSparse matrix collec-

tion [61]. Table 3.2 provides a list of all sparse matrices along with their outermost

72

dimension (N), number of nonzeros (NNZ), and row density (rdensity). Sparse matrices
are listed in order of increasing rdensity. These sparse matrices range from N as small
as 62,631 (brack2) to as large as 18,318,143 (hugebubbles-00000) and rdensity from as
sparse as 2.758 (roadNet-TX) to as dense as 71.53 (bmwcra 1).

Table 3.2: Test suite of matrices including their name, (N), number of nonzeros (NNZ),
and row densities (rdensity).
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Matrix
roadNet-TX
hugetrace-00000
hugetric-00000
hugebubbles-00000
wi2010
G3 circuit
fl2010
ecology1
cont-300
delaunay n20
thermal2
brack2
wave
packing-500x100x100
Emilia 923
bmwcra 1

N
1,393,383
4,588,484
5,824,554
18,318,143
253,096
1,585,478
484,481
1,000,000
180,895
1,048,576
1,228,045
62,631
156,317
2,145,852
923,136
148,770

NNZ rdensity
3,843,320
2.76
13,758,266
2.99
17,467,046
2.99
54,940,162
2.99
1,209,404
4.77
7,660,826
4.83
2,346,294
4.84
4,996,000
4.99
988,195
5.46
6,291,372
6.00
8,580,313
6.98
733,118
11.71
2,118,662
13.55
34,976,486
16.30
40,373,538
43.74
10,641,602
71.53

Problem Type
Undirected Graph
DIMACS
DIMACS
DIMACS
DIMACS
Circuit Simulation
DIMACS
2D/3D Problem
Optimization Problem
DIMACS
Thermal Problem
2D/3D Problem
2D/3D Problem
DIMACS
Structural Problem
Structural Problem

For KokkosKernels, cuSPARSE, and MKL, all sparse matrices are first reordered
utilizing RCM obtained from GNU Octave’s symrcm. As noted before, RCM improves
convergence for some iterative solvers (e.g., CG) and improves performance due to reducing the stride of irregular accesses. We opt to feed CSR-k with matrices in their natural
ordering to demonstrate the efficacy of the Band-k algorithm [53]. This will demonstrate
even when competing algorithms are fed with bandwidth-reducing reorderings, Band-k

73

and CSR-k are able to match and outperform existing highly-optimized implementations
of SpMV.

3.5.4

Test Methodology
For GPU tests, data is copied ahead of time to the GPU, and only the SpMV

kernel execution time is measured. This is done to accurately model the behavior of
iterative solvers, which should not re-copy the data each iteration. For both CPU and
GPU tests, 20 runs are performed and the results are averaged via arithmetic mean.
On CPU tests, 5 untimed warmup runs are performed, because MKL appears to take
1-2 iterations before reaching maximum performance. For instance, when the matrix
hugebubbles in RCM ordering is run 20 times under MKL without a warmup, the first
iteration takes ∼2.8× longer than subsequent iterations. CSR-k is given the same 5
warmup runs for fairness.
cuSPARSE appears to show the same behavior, but only for poorly ordered or
badly reordered matrices. For example, the same hugebubbles matrix in its natural
ordering takes ∼37× longer the first iteration than subsequent iterations, but its RCM
ordering only takes ∼1.5× longer the first time than subsequent runs. Since we are giving
cuSPARSE RCM-ordered matrices, we opt to not perform any warmup runs on GPU.
KokkosKernels does not appear to display any similar behavior.

3.6

GPU Performance

In this section, we analyze the performance of CSR-3 on Volta and Ampere GPUs.
We tune the super-super-row and the super-row sizes based on the automatic analysis base
74

tuning in Section 3.4. We compare these results to both the CSR representation format
in cuSPARSE and KokkosKernels for the Volta and only cuSPARSE on the Ampere due
to KokkosKernels’s current compiler limitations. We utilize two forms of measurement
for performance. The first measures GFlop/s which is the standard method in other
works [22, 48] related to SpMV on GPUs to provide raw performance numbers. The
second measure is relative performance compare to cuSPARSE defined as:

Relative Perform(m) =

time (cuSP ARSE, m) − time (CSR-3, m)
× 100
max (time (cuSP ARSE, m) , time (CSR-3, m))

where m is the given sparse matrix.
Relative performance is one of the main performance metrics we use in this paper.
We choose this metric because scaling is mirrored across 0. In essence, if CSR-3 is twice
as fast as cuSPARSE, this metric will show 50% relative performance. If CSR-3 is half
as fast as cuSPARSE, this metric will show −50% relative performance. This metric is
not a linear scaling, but rather a reciprocal scaling. So, if CSR-3 is three times as fast as
cuSPARSE, this metric will show 67% relative performance, and if CSR-3 is four times as
fast, this metric will show 75% relative performance. Assuming some fixed cuSPARSE
time, as the time to compute a CSR-3 kernel approaches 0, this metric will approach
100%, and as the time to compute CSR-3 approaches ∞, this metric will approach
−100%. Finally, this metric shows 0% if CSR-3 and cuSPARSE are equally as fast.
Figure 3.5 provides the performance observations for Volta. The top figure (i.e.,
3.5a), provides the GFlop/s as bars, and the dashed lines are the average GFlop/s for the
test suite. The average GFlop/s for this test suit is 79.6 GFlop/s for cuSPARSE, 80.9

75

(a) GFlop/s on Volta

(b) Relative perform on Volta

Figure 3.5: Performance on Volta. With a couple exceptions, CSR-k tends to outperform cuSPARSE and KokkosKernels with higher relative perform and GFlop/s.

GFlop/s for KokkosKernels, and 87.7 GFlop/s for CSR-3. We note that the matrices are
ordered based on their rdensity from left to right. In theory, the more dense sparse matrices increase the potential computational intensity and should achieve a higher GFlop/s.
We do observe this trend with the last three matrices performing over 100 GFlop/s. Overall CSR-3 does outperform the other two formats except in a couple of cases. These two
cases are the matrices generated from graphs in the DIMAC challenge (i.e., matrices 2-4)
and the matrices with higher rdensity (i.e., matrices 14-16). In the first case, KokkosKernels is the best. We believe that this is due to KokkosKernels being highly tuned for
this set of matrices as observed by numerous publications from the KokkosKernels group
on DIMAC. As for the more dense matrices, CSR-3 is outperformed by cuSPARSE even
with CSR-3 utilizing our GPUSpMV-3.5 method. However, we note that most sparse
matrices from partial differential equations (PDEs) that would utilize an iterative solve
method (e.g., CG and GMRES) have rdensity < 40. In the lower figure, the relative

76

performance is reported. The average relative performance improvement is 17.3% over
cuSPARSE.
Figure 3.6 provides the performance on Ampere. The Ampere card is a big improvement relative to Volta. Ampere has a maximum peak 32-bit floating-point performance of 19.5 teraflop/s compared to the 15.7 teraflop/s of the Volta card. However,
the major difference comes in the form of the memory hierarchy. The main memory of
Ampere is much larger, the L2 cache is 7× larger, and a new asynchronous copy directly
from global memory to shared memory has been introduced. The left figure (i.e., 3.6a)
provides the GFlops/s for the test suite on Ampere when autotuning parameters for Ampere are found. We observe the average GFlop/s for cuSPARSE is 131.72 GFlop/s and
for CSR-3 is 142.93 GFlop/s. Again, CSR-3 outperforms cuSPARSE on almost all sparse
matrices except the last three very dense ones. The right figure (i.e., Figure 3.6b) pro-

(a) GFlop/s on Ampere

(b) Relative perform on Ampere

Figure 3.6: Performance on Ampere. As with Volta, CSR-k tends to outperform cuSPARSE with a couple exceptions.

77

vides the relative performance, and we observe the average relative performance increase
of approximately 18.9%.

3.6.1

Banding
We also perform a banding evaluation to ensure that the performance increase

from CSR-k is not due to a superior banding algorithm. We assert that the used implementation of Band-k is rather poor when compared to RCM, and thus puts CSR-k at a
disadvantage.
To demonstrate this assertion, we test KokkosKernels with matrices in their natural ordering, Band-k reordered matrices reduced to CSR, and RCM ordering. Additionally, we test CSR-k with Band-k matrices, and with RCM matrices that are then
fed into Band-k for a second reordering. We do not test MKL or cuSPARSE as it is
unknown from documentation and observations of warmup cost (Section 3.5.4) if either
of these does any reordering of their own. This last test is to simulate the potential
performance gains from a more intelligent Band-k implementation. For comparison, we
take the arithmetic mean of relative performance compared to KokkosKernels with RCM
ordering. Thus, the KokkosKernels (RCM) bar will show as 0 on the graph, with less
performant configurations < 0 and more performant configurations > 0.
Results are shown in Figure 3.7. All CSR-k configurations have a positive relative performance, demonstrating that they are outperforming KokkosKernels with RCM
ordering. KokkosKernels with Band-k ordering is the worst performing in the suite, underperforming even KokkosKernels with no ordering whatsoever. Thus, we may conclude

78

Figure 3.7: Banding analysis of KokkosKernels and CSR-k with miscellaneous banding
combinations. Band-k ordering performs worse than RCM or natural ordering when
using Kokkos.

that the performance gains are not due to a superior banding algorithm and that in fact,
CSR-k leaves performance on the table with a sub-par banding implementation.

3.7

CPU Performance

In this section, we verify the ability of CSR-k to improve performance on CPUs.
This is necessary as the original paper was printed over eight years ago (2014), and the
design of CPUs has changed. In particular, the three design changes have been increased
core count on a single socket, different caching structures (e.g., multiple segmented shared
caches as in Rome), and different hardware reuse techniques (e.g., newer Intel systems).
In order to keep the same style as the previous section, we report the GFlop/s for
a thread count matching the core count on one socket (e.g. 64 threads on Rome and 40
threads on Ice Lake). We further provide a scalability study for a handful of representative
thread counts. Finally, we demonstrate the performance penalty incurred by a constanttime tuning via fixed super-row size. We utilize CSR-2 for these experiments and test

79

(a) GFlop/s on Ice Lake

(b) Relative perform on Ice Lake

Figure 3.8: Performance on Ice Lake. MKL tends to perform better, but not by a huge
margin.

(a) GFlop/s on Rome

(b) Relative perform on Rome

Figure 3.9: Performance on Rome. MKL and CSR-k perform roughly identically.

against Intel Math Kernel Library. The average GFlop/s is reported as a dashed line in
figures 3.8a and 3.9a.

For these average cases, we notice that CSR-k is usually on-par

with MKL, falling slightly behind in GFlop/s. On Ice Lake, MKL had an average of 52.3
GFlop/s, while CSR-k had an average of 49.3 GFlop/s. On Rome, the numbers are even
closer, where MKL had an average of 75.1 GFlop/s and CSR-k had an average of 72.5
GFlop/s.
80

We use the same relative performance metric from the GPU section, defined here
as
Relative Perform(m) =

time (MKL, m) − time (CSR-3, m)
× 100
max (time (MKL, m) , time (CSR-3, m))

where m is the given sparse matrix. Figures 3.8b and 3.9b provide the relative performance of CSR-k versus MKL. We notice on Ice Lake that CSR-k is about −5.4%
relative performance compared to MKL, while on Rome, CSR-k is about 1.3% relative
performance compared to MKL.
Figures 3.10a and 3.10b presents our scalability study. For Ice Lake, we choose
the core counts c ∈ {1, 4, 16, 40}, and on Rome we choose c ∈ {1, 4, 16, 64}. Results are
the geometric mean of speedup across all matrices, and results are normalized to MKL
on 1 core for both systems. We notice that on Ice Lake, MKL tends to outperform CSR-2
on all core counts. However, both algorithms scale very well, with a maximum speedup
of ∼28.5× on 40 cores for MKL, and ∼25.5× on 40 cores for CSR-2. On Rome, MKL

(a) Scalability on Ice Lake

(b) Scalability on Rome

Figure 3.10: Scalability study. CSR-k is able to scale nearly equally to MKL on Rome,
though less than optimally on Ice Lake.

81

scales better up to 4 cores, where it is then surpassed by a thin margin by CSR-2 for
higher core counts. On 64 cores, MKL achieves a maximum speedup of ∼31.7×, and
CSR-2 achieves its maximum of ∼32.7×.
Compared to the GPU implementation of CSR-3, the CPU implementation of
CSR-2 is relatively unoptimized and naive. We use simple OpenMP statically-scheduled
parallelism on the outermost loop and compiler-specific vectorization pragmas on the
innermost loop. Finally, on Rome, we found that prefetching some values via compiler
intrinsics can help for larger matrices with N N Z > 12, 500, 000. However, we do not
consider the CPU implementation to be particularly optimized when compared to the
GPU implementation. We suspect that more aggressive optimizations could be beneficial
in closing the performance gap on both CPU systems. Several routes for optimization
were explored, such as unrolling-and-jamming the innermost 2 loops [62], aggressive
AVX intrinsics, and a miscellany of compiler-specific pragmas. However, we have not
extensively exhausted all possible routes for optimization, and leave this as a future
avenue of exploration.
Figure 3.11 shows the results of our best attempt at a constant-time tuning for
CPU-based CSR-k. We take the geometric mean of optimal super-row sizes across the
dataset, which is 81. We round this up to 96, which was in our super-row test set as
mentioned in Section 3.4.2. Finally, all matrices are tested with a fixed super-row size
of 96. The relative performance metric is used once again. The majority of the matrices
perform very well, with about −5% relative performance or better compared to their
optimal tunings. However, there are a handful of outliers like hugetrace or Emilia 923
which are much more sensitive to an optimal super-row size. Many of these matrices prefer
82

Figure 3.11: Constant super-row size on Rome, with SR fixed to 96. Most matrices
tend to perform only slightly less optimally, with a few outliers being very sensitive.

a super-row size in a very narrow window and are largely agnostic to changes outside
that window. We are unsure what causes this. Overall, we note that with the outliers
in the dataset, using a fixed SR = 96 results in a performance hit of −10.2% relative
performance. With the four most significant outliers removed from the dataset (those
with relative performance < −20%), the performance hit is −3.5% relative performance.
Finally, we note that the optimal super-row size can vary greatly, but usually
doesn’t significantly impact performance, aside from some outliers. Consistent with the
findings in [6], usually a super-row size between 40 and 1000 works well. However, there
were several matrices that preferred very small super-row sizes (such as brack2 which
performed optimally with a SRS of 8), and one that preferred a very large super-row size
(hugebubbles which performed optimally with a SRS of 2048).

83

3.8

Analysis

We lastly want to analyze the cost of utilizing CSR-k as a heterogeneous format.
There are two costs to consider, namely: step up cost and memory overhead for storage.
For the former, we argue that the original work and add-on triangular solution work
demonstrates that the setup cost could be amortized over the multiple uses in the application. Moreover, these works demonstrate that CSR-k’s setup cost is lower than an
autotuning method such as pOSKI. Our additional use of a formula-driven autotuning
for GPUs in Section 3.4 reduces the cost of tuning a particular sparse matrix for a GPU
once autotuning derives the parameters of the formula. For a test suite of our size, this
autotuning to derive the needed parameters can be done in under four hours. Therefore,
we claim that the tuning time is more than acceptable since you only need to derive the
parameters once for a new device and additional sparse matrices can be tuned using the
derived formula in constant time.
The second measure to consider is overhead and is one of the advantages of CSR-k.
Figure 3.12 provides the overhead percentage of utilizing just CSR-3 for GPU processing and CSR-3 + CSR-2 for GPU and CPU processing over CSR. In this diagram for
CSR-3, we use the SSRS and SRS determined by the closed-form heuristic explored in
Section 3.4. For CSR-2, we use an SRS = 96, as this is close to the geometric mean of
optimal SRS across the dataset, also mentioned in Section 3.4.
We observe that the cost of keeping both CSR-3 and CSR-2 data only requires a
couple of additional arrays of super-row pointers and does not add much to the overhead.
In the worst case (i.e., roadNet-TX), the overhead is just slightly over 2% for the combined

84

Figure 3.12: Storage overhead percentage of CSR-3 versus base CSR.

format. We also notice a trend that as the rdensity becomes very large, the overhead
decreases. Two factors contribute to this trend. The first factor is that the Band-k
ordering will more aggressively combine nodes in the coarsen phase due to the number of
heavy edges. The second factor is due to our tuning and utilization of GPUSpMV-3.5.
In the tuning of GPUSpMV-3.5, we try to exploit as many SIMD operations in a single
row by adding a new dimension. However, since the max number of threads is fixed in
a block (i.e, 1024), the number of threads that can be assigned to the outer dimension
decreases and so these blocks are normally made larger. The final factor is that the vals
and col idx matrix take up proportionally more space when compared to the compressed
row pointers as row density increases.
The low storage overhead for CSR-k is important in two ways. First, it demonstrates that CSR-k should not put excess demand on the memory subsystem compared
to CSR, which is important for sparse matrix formats since SpMV is limited by memory
bandwidth. Second, it demonstrates that a matrix in CSR-k format should run on most
or all devices that are capable of running the same matrix in CSR format. This is a tes-

85

tament to the portability and heterogeneity of the format, as a matrix in CSR-k format
has sufficiently small overhead to conceivably run on the vast majority of computational
devices that can run equivalent CSR matrices.

3.9

Conclusion

This chapter presents CSR-k as a heterogeneous format solution for SpMV. This
CSR based format is easy to understand, can be tuned quickly, and can be used as-is
by library calls that require a standard CSR format. Moreover, we present a method
to tune CSR-3 in constant time for a particular sparse matrix utilizing an autotuned
formula model using a relatively small dataset. Using tuned CSR-k, the format method
outperforms NVIDIA’s cuSPARSE by ∼17.3% on Volta and ∼18.9% on Ampere GPUs.
On AMD Rome and Intel Ice Lake CPUs, CSR-k maintains comparable performance to
Intel MKL, though falls slightly behind on Ice Lake. We suspect further hand-tuning
of the implementation could close this gap. Moreover, the additional memory overhead
to keep this heterogeneous format for both GPU and CPU devices was < 2.5% over
standard CSR. Therefore, we demonstrated that CSR-k is an ideal heterogeneous format
for many-core CPUs and NVIDIA GPUs.

86

CHAPTER 4

CONCLUSION

A requirement for tuning exits in high-performance computing at a fundamental
level. For SpMV, recent works have shown a trend towards big data, leading to expensive
and costly autotuning. Moreover, where autotuning is not possible, hand-tuning is often
a requirement, which is a long and tedious process. We present a case study of two
applications with low-cost heuristics for autotuning: loop scheduling, and SpMV.
For the first case study, we note that many large-scale scientific applications depend on writing code in a fork-join manner that uses parallel-for loops. When the application is regular, i.e., the workload is evenly distributed among the iterations, a simple
static scheduling that evenly distributes iterations to threads may be enough to achieve
good performance. However, many application codes in HPC are highly irregular, i.e.,
the workload is not evenly distributed among iterations, and static scheduling methods
do not work well. As a result, a wide range of research has centered on constructing
self-scheduling methods that help to better balance the workload among threads when
assigned. Most of these self-scheduling methods require the user to have expert knowledge
of what the workload distribution is for the particular input and to fine-tune a parameter
called chunk size. Adding to the difficulty for the user is that no self-scheduling method

87

is good over a wide range of applications without this fine-tuning. We propose iCh as a
performant loop scheduler that does not require expert tuning by the high-performance
engineer. It provides an adaptive chunk size based on simple, cheap-to-compute ratios
and achieves performance near expertly tuned schedulers. Finally, it is not very sensitive
to its one tuning parameter, ϵ, and we provide guidance for tuning this parameter based
on the workload.
For the second case study, we observe that an efficient heterogeneous format for
SpMV is required by modern high-performance computing. This format needs to be easy
to understand, easy to tune, and provide portable performance across devices without
using too much additional memory overhead or time to tune. We propose CSR-k as a
performant sparse matrix format which, like iCh, does not require expert tuning by the
high-performance engineer. We provide an O(1) tuning heuristic for tuning super-row
sizes of CSR-k on GPU and demonstrate that good performance gains can be realized
with this cheap autotuning.
Between these two case studies, we have proposed alternatives to modern heavyweight, time-consuming tuning. These alternatives provide lightweight, low-cost autotuning which is able to meet or exceed the performance of popular tuned approaches.
We hope that these case studies demonstrate that expensive tuning is not necessary to
achieve good performance for common scientific codes.

88

REFERENCES

[1] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on
Applied Mathematics, 17(2):416–429, 1969.
[2] Thomas Cormen, Charles Leinerson, Ronald Rivest, and Clifford Stein. Introduction
to Algorithms, Third Edition. The MIT Press, 2009.
[3] Pedro Henrique Penna, Antônio Tadeu A. Gomes, Márcio Castro, Patricia
D.M. Plentz, Henrique C. Freitas, François Broquedis, and Jean-François Méhaut. A
comprehensive performance evaluation of the binlpt workload-aware loop scheduler.
Concurrency and Computation: Practice and Experience, 31(18):e5170, 2019. e5170
cpe.5170.
[4] Joshua Dennis Booth and Phillip Allen Lane. An adaptive self-scheduling loop
scheduler. Concurrency and Computation: Practice and Experience, page e6750.
[5] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api
for shared-memory programming. Computational Science & Engineering, IEEE,
5(1):46–55, 1998.
[6] Humayun Kabir, Joshua Dennis Booth, and Padma Raghavan. A multilevel compressed sparse row format for efficient sparse computations on multicore processors.
In 2014 21st International Conference on High Performance Computing (HiPC),
pages 1–10, 2014.
[7] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation
of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998
conference on Programming language design and implementation - PLDI '98. ACM,
ACM Press, 1998.
[8] C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Transactions on Computers, C36(12):1425–1439, 1987.
[9] A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos. History-aware selfscheduling. In 2006 International Conference on Parallel Processing (ICPP’06),
pages 185–192. ACM, 2006.
[10] Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix
multiplication. ACM Trans. Math. Softw., 34(3), may 2008.
89

[11] Kent Milfeld. Past project: Gotoblas 2.
[12] Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. Bridging the gap between
deep learning and sparse matrix format selection. SIGPLAN Not., 53(1):94–108, feb
2018.
[13] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful
visual performance model for multicore architectures. Commun. ACM, 52(4):65–76,
apr 2009.
[14] Michael Frasca, Kamesh Madduri, and Padma Raghavan. NUMA-aware graph mining techniques for performance and energy efficiency. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE,
IEEE, nov 2012.
[15] Eduardo F. D’Azevedo, Mark R. Fahey, and Richard T. Mills. Vectorized sparse
matrix multiply for compressed row storage format. In Proceedings of the 5th International Conference on Computational Science - Volume Part I, ICCS’05, pages
99–106, Berlin, Heidelberg, 2005. Springer, Springer-Verlag.
[16] Humayun Kabir, Joshua Dennis Booth, and Padma Raghavan. A multilevel compressed sparse row format for efficient sparse computations on multicore processors.
In 2014 21st International Conference on High Performance Computing (HiPC).
IEEE, IEEE, dec 2014.
[17] Ioana Banicescu, Vijay Velusamy, and Johnny Devaprasad. On the scalability of
dynamic scheduling scientific applications with adaptive weighted factoring. Cluster
Computing, 6(3):215–226, 2003.
[18] Dirk Schmidl, Peter Philippen, Daniel Lorenz, Christian Rössel, Markus Geimer,
Dieter an Mey, Bernd Mohr, and Felix Wolf. Performance analysis techniques for
task-based openmp applications. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, IWOMP’12, pages 196–209, Berlin,
Heidelberg, 2012. Springer, Springer-Verlag.
[19] Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S.
Müller. Assessing the performance of openmp programs on the intel xeon phi. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par’13,
pages 547–558, Berlin, Heidelberg, 2013. Springer, Springer-Verlag.
[20] Michael A. Heroux, Padma Raghavan, and Horst D. Simon. Parallel Processing for
Scientific Computing. Society for Industrial and Applied Mathematics, 2006.
[21] Richard Vuduc, James Demmel, and Katherine Yelick. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics Conference Series, 16:521–530,
01 2005.

90

[22] Weifeng Liu and Brian Vinter. CSR5: An efficient storage format for cross-platform
sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, page 339–350, New York, NY, USA,
2015. Association for Computing Machinery.
[23] Marie Durand, François Broquedis, Thierry Gautier, and Bruno Raffin. An efficient
OpenMP loop scheduler for irregular applications on large-scale NUMA machines. In
OpenMP in the Era of Low Power Devices and Accelerators, pages 141–155. Springer
Berlin Heidelberg, 2013.
[24] Srikant Subramaniam and Derek L. Eager. Affinity scheduling of unbalanced workloads. In Proceedings of the 1994 ACM/IEEE conference on Supercomputing - Supercomputing '94. ACM, ACM Press, 1994.
[25] Z. Fang, P. Tang, P. . Yew, and C. . Zhu. Dynamic processor self-scheduling for
general parallel nested loops. IEEE Transactions on Computers, 39(7):919–929,
1990.
[26] Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. Hpcg benchmark: a new
metric for ranking high performance computing systems. Technical Report ut-eecs15-736, University of Tennessee, University of Tennessee, 2015-01 2015.
[27] Joshua Dennis Booth and Gregory Bolet. An on-node scalable sparse incomplete
LU factorization for a many-core iterative solver with javelin. Parallel Comput.,
94-95:102622, 2020.
[28] R W Vuduc and H Moon. Fast sparse matrix-vector multiplication by exploiting
variable block structure. Technical report, OSTI, address, jul 2005.
[29] S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711–725, nov 1997.
[30] Ali Pinar and Michael T. Heath. Improving performance of sparse matrix-vector
multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing
(CDROM) - Supercomputing '99. ACM, ACM Press, 1999.
[31] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices.
In Proceedings of the 1969 24th National Conference, ACM ’69, page 157–172, New
York, NY, USA, 1969. Association for Computing Machinery.
[32] Joshua Dennis Booth, Nathan D. Ellingwood, Heidi K. Thornquist, and
Sivasankaran Rajamanickam. Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts. Parallel Comput., 68:17–31, 2017.
[33] Guillaume Aupy, Anne Benoit, Sicheng Dai, Loı̈c Pottier, Padma Raghavan, Yves
Robert, and Manu Shantharam. Co-scheduling amdahl applications on cachepartitioned systems. The International Journal of High Performance Computing
Applications, 32(1):123–138, 2018.

91

[34] Yong Yan, Canming Jin, and Xiaodong Zhang. Adaptively scheduling parallel loops
in distributed shared-memory systems. IEEE Transactions on Parallel and Distributed Systems, 8(1):70–81, 1997.
[35] Yizhuo Wang, Weixing Ji, Feng Shi, Qi Zuo, and Ning Deng. Knowledge-based
adaptive self-scheduling. In James J. Park, Albert Y. Zomaya, Sang-Soo Yeo, and
Sartaj Sahni, editors, Network and Parallel Computing, 9th IFIP International Conference, NPC 2012, Gwangju, Korea, September 6-8, 2012. Proceedings, volume 7513
of Lecture Notes in Computer Science, pages 22–32. Springer, Springer, 2012.
[36] Joshua Dennis Booth, Jagadish Kotra, Hui Zhao, Mahmut Kandemir, and Padma
Raghavan. Phase detection with hidden markov models for DVFS on many-core
processors. In 2015 IEEE 35th International Conference on Distributed Computing
Systems. IEEE, IEEE, jun 2015.
[37] B. P. Welford. Note on a method for calculating corrected sums of squares and
products. Technometrics, 4:419–420, 1962.
[38] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron.
Rodinia: A benchmark suite for heterogeneous computing. In NA, editor, 2009
IEEE International Symposium on Workload Characterization IISWC, volume NA
of No Series, pages 44–54. IEEE, IEEE, 2009.
[39] Timothy A. Davis and Yifan Hu. The University of Florida sparse matrix collection.
ACM TOM, 38(1):1:1–1:25, 2011.
[40] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood,
S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr.
XSEDE: Accelerating scientific discovery. Computing in Science & Engineering,
16(5):62–74, Sept.-Oct. 2014.
[41] István R eguly and Mike Giles. Efficient sparse matrix-vector multiplication on
cache-based gpus. In 2012 Innovative Parallel Computing (InPar), pages 1–12, 2012.
[42] Shizhao Chen, Jianbin Fang, Donglin Chen, Chuanfu Xu, and Zheng Wang. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In 2018 IEEE 20th International Conference on High Performance
Computing and Communications; IEEE 16th International Conference on Smart
City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 649–658, 2018.
[43] JaeHyuk Kwack, John Tramm, Colleen Bertoni, Yasaman Ghadar, Brian Homerding, Esteban Rangel, Christopher Knight, and Scott Parker. Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus.
In 2021 International Workshop on Performance, Portability and Productivity in
HPC (P3HPC), pages 45–56, 2021.

92

[44] Orhan Kislal, Wei Ding, Mahmut Kandemir, and Ilteris Demirkiran. Optimizing
sparse matrix vector multiplication on emerging multicores. In 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS), pages
1–10, 2013.
[45] Alan George and Joseph W. Liu. Computer Solution of Large Sparse Positive Definite. Prentice Hall Professional Technical Reference, 1981.
[46] Youcef Saad. SPARSKIT: a basic tool kit for sparse matrix computations - version
2, 1994.
[47] pOSKI: Parallel optmizaed sparse kernel interface.
[48] José I. Aliaga, Hartwig Anzt, Thomas Grützmacher, Enrique S. Quintana-Ortı́, and
Andrés E. Tomás. Compression and load balancing for efficient sparse matrix-vector
product on multicore processors and graphics processing units. Concurrency and
Computation: Practice and Experience, page e6515.
[49] H. Carter Edwards and Christian R. Trott. Kokkos: Enabling performance portability across manycore architectures. In 2013 Extreme Scaling Workshop (xsw 2013),
pages 18–24, 2013.
[50] Richard Wilson Vuduc and James W. Demmel. Automatic Performance Tuning of
Sparse Matrix Kernels. PhD thesis, 2003. AAI3121741.
[51] Ryan Eberhardt and Mark Hoemmen. Optimization of block sparse matrix-vector
multiplication on shared-memory parallel architectures. In 2016 IEEE International
Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 663–
672, 2016.
[52] Richard W. Vuduc and Hyun-Jin Moon. Fast sparse matrix-vector multiplication
by exploiting variable block structure. In Proceedings of the First International
Conference on High Performance Computing and Communications, HPCC’05, page
807–816, Berlin, Heidelberg, 2005. Springer-Verlag.
[53] The effect of ordering on preconditioned conjugate gradients. BIT Numerical Mathematics, 29:635–657, 1989.
[54] Stephen T. Barnard, Alex Pothen, and Horst Simon. A spectral algorithm for
envelope reduction of sparse matrices. Numerical Linear Algebra with Applications,
2(4):317–334, 1995.
[55] Humayun Kabir, Joshua Dennis Booth, Guillaume Aupy, Anne Benoit, Yves Robert,
and Padma Raghavan. STS-k: A multilevel sparse triangular solution scheme for
numa multicores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, New York, NY, USA,
2015. Association for Computing Machinery.
[56] David Kincaid, Thomas Oppe, and David Young. Itpackv 2d user’s guide.
93

[57] F. Vázquez, G. Ortega, J.J. Fernández, and E.M. Garzón. Improving the performance of the sparse matrix vector product with gpus. In 2010 10th IEEE International Conference on Computer and Information Technology, pages 1146–1151,
2010.
[58] Marco Maggioni and Tanya Berger-Wolf. Adell: An adaptive warp-balancing ell
format for efficient sparse matrix-vector multiplication on gpus. In 2013 42nd International Conference on Parallel Processing, pages 11–20, 2013.
[59] Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. Automatic selection of sparse matrix representation on gpus. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15,
page 99–108, New York, NY, USA, 2015. Association for Computing Machinery.
[60] Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. Smat: An input adaptive auto-tuner for sparse matrix-vector multiplication. SIGPLAN Not.,
48(6):117–126, jun 2013.
[61] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection.
ACM Trans. Math. Softw., 38(1), dec 2011.
[62] S. Carr, Chen Ding, and P. Sweany. Improving software pipelining with unroll-andjam. In Proceedings of HICSS-29: 29th Hawaii International Conference on System
Sciences, volume 1, pages 183–192 vol.1, 1996.

94

