Data parallel algebraic multigrid by Dalton, Steven
c© 2014 by Steven T Dalton. All rights reserved.
DATA PARALLEL ALGEBRAIC MULTIGRID
BY
STEVEN T DALTON
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2014
Urbana, Illinois
Doctoral Committee:
Professor Luke Olson, Chair
Professor William Gropp
Professor Michael Heath
Dr. Jed Brown, Argonne National Laboratory
Abstract
Algebraic multigrid methods for large, sparse linear systems are central to many computational simula-
tions. Parallel algorithms for such solvers are generally decomposed into coarse-grain tasks suitable for
distributed computers with traditional processing cores. Accelerating multigrid methods on massively par-
allel throughput-oriented processors, on the other hand, demands algorithms with abundant fine-grained
parallelism. In this dissertation we analyze and decompose the smoothed aggregation algebraic multigrid
method into parallel primitives effectively mapped to emerging architectures. Our formulation is performed
in two phases, the first building on high-level generic abstractions to construct our solver in a architecture
agnostic manner. Though effective we show that performance is severely limited by irregular sparse ma-
trix operations, most notably sparse matrix-matrix multiplication. In the second phase, we address this
performance bottleneck using novel techniques to optimize irregular sparse matrix operations.
This dissertation is also concerned with irregular graph operations necessary to partition sparse matrices
into disjoint sets for parallel processing. We apply our solver to accelerate eigenvector computations necessary
during spectral partitioning methods and find that performance is limited by multigrid construction. We
propose and analyze a novel strategy to accelerate graph Laplacian eigenvector computations by combining
iterative methods, namely blocked Lanczos, with breadth first search operations on graphs. By basing our
partitioner on primitives with substantial parallelism we demonstrate notable performance improvement
compared with generic eigensolvers.
ii
To my loving wife, Kymberly, for all of her support and encouragement throughout this entire process and
my precious son Colin, my future children, grandchildren and great-grandchildren, I dedicate this thesis to
you. Always remember you can do anything you put your mind to. Be great!
iii
Acknowledgments
I am grateful for all the friends and family members who have influenced, encouraged, and supported me
throughout my entire college career as an undergraduate and graduate student.
• To my advisor, Professor Luke Olson, who introduced me to algebraic multigrid, encouraged my
curiosity, and continually provided constructive feedback on my progress throughout this long journey
from start to finish for five years, thank you so much for you support and guidance. I know I did
not make this process particularly easy at times! This dissertation would not have been possible with
anyone else.
• My unofficial co-advisor Nathan Bell, who taught me a great deal on algebraic multigrid, writing
programs, writing papers, and many other areas. It is impossible to express how much of an impact
working with you had on my graduate experience and my life.
• Michael Heath for encouraging me to attend UIUC during visit day and for serving as my mentor for
my first year of graduate school. I thank you for listening to me and giving me sound advice when all
I wanted to do was quit.
• Michael Garland for allowing me to spend two productive summers working in his research group
at Nvidia and supporting my education with a fellowship. I gained significantly by interacting with
members of your group, Duane Merrill, Sean Baxter, Bryan Catanzaro, and Jared Hoberock.
• My friends in scientific computing group – Elena Caraba, Jehanzeb Chaudhry, James Lai, Russ Hewett,
Amanda Bienz, Hormozd Gahvari, Mitya Yershov, Matt Wrobel, and Matthew Michelotti - for sharing
the graduate school experience with me. I would especially like to thank my qualifying exam study
buddy Kaushik Kalyanaraman for spending weeks with me discussing numerical methods in painstaking
detail.
• My undergraduate advisor Professor Umakishore Ramachandran at Georgia Tech for nurturing my
ideas and allowing me to feel like a contributing member of his research group. Professor George Riley
iv
for exposing me to my first real research project that set the course for my career. David Smith for
supporting my undergraduate education with endless teaching assistantships and witty conversation.
• My parents for all of their support both mentally and financially. There is no possible way I would
have made it without the both of you. My brother for being my first research collaborator and mentor
as we discovered everything we could about computers together as kids.
• Last but not least my loving wife Kym whose constant love and support throughout this process has
been the driving force towards the completion of my graduate studies and this dissertation. My son,
Colin, thank you for supplying me with daily inspiration and staying up late to help me write large
parts of this dissertation, although I had to remove many of your contributions, since you are still
learning to read and write, I will always cherish your enthusiasm and creativity.
v
Table of Contents
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 Algebraic Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Components of Algebraic Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Parallel Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Parallel Multigrid Cycling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 3 Sparse Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 ESC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 4 Stable Sparse Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 Sparse Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 SpMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 5 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Spectral Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Initial Guess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 SpMV Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 6 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Appendix A Cusp : A C++ Templated Sparse Matrix and Graph Library . . . . . . . . . 100
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vi
Chapter 1
Introduction
Computational science utilizes computers and mathematics to enhance scientific discovery. This interplay
has led to accelerated advancements in a variety of disciplines throughout science. Contributing to this,
numerical simulations that were beyond the reach of supercomputers 30 years ago are routinely performed
on a single desktop today. This rapid growth in computational power is due to considerable advancements in
both computer engineering and algorithmic efficiency. The dependency of the latter property on the former
means that architectural shifts necessarily influence the evolution of algorithms and vice versa. In this work
we consider specifically numerical algorithms used to solve large systems of linear equations and highlight
that emerging highly parallel architectures require the reformulation of computationally optimal numerical
solvers.
The motivation for optimal numerical solvers is driven by the demand for increasingly accurate compu-
tational models to simulate real world processes. As the fidelity and complexity of these models increases
so to does the role of these models in science. Indeed, the ability to predict and analyze physical phe-
nomenon based solely on computational data has been solidified as the third modern pillar of the scientific
method. Though increased model detail yields more accurate simulations the increased complexity neces-
sitates additional degrees of freedom or variables, N , to fully describe the fine-grained processes governing
the corresponding system of linear equations, A ∈ RN×N . Therefore the cost of computing x given b in
Ax = b is heavily dependent on the selected solver and may be unreasonably expensive for direct methods,
such as Gaussian elimination which scale as O(N3) in both spatial and computational complexity, as shown
in Figure 1.1. The ideal complexity would depend only linearly on N , meaning the cost of computing the
solution is proportional to the number of unknowns, which is asymptotically optimal. In turn, the availabil-
ity of O(N) solvers motivates the development of more detailed models creating a circular dependence that
pushes the boundaries of both computing and algorithmic capabilities.
The overhead associated with direct methods, in terms of space and computation, is reduced significantly
by adopting a strategy based on iterative improvement. A well-known class of these iterative solvers is based
on Krylov subspaces and includes the conjugate gradient (CG) and generalized minimum residual (GMRES)
1
Increased
model
accuracy
Optimal
numerical
methods
(O(N))
Larger
sparse
system of
equations
(N)
Number of elements
C
om
p
u
ta
ti
on
al
w
or
k
p
er
va
ri
ab
le
O(N3) O(N2) O(N logN)
O(N1.5)
O(N1.2)
O(N)
Figure 1.1: Improved model accuracy increases the number of unknowns, N , in the resulting linear
system which in turn places additional pressure on numerical methods to minimize the costs, O(N), of
computing the solution. As N grows classical approaches that require quadratic or cubic work are no
longer feasible.
methods. Krylov methods begin with an initial subspace, usually a single vector, and proceed by repeatedly
minimizing some quantity-of-interest over a subspace. The total work required to compute the solution to
a prescribed level of accuracy therefore relies heavily on the ability of the method to adequately reduce
the error, the difference between the current approximation and true solution, during each iteration. More
precisely the convergence is often related to the spectral distribution of the eigenvalues and the modulus of
the largest and smallest generally grows with N . Operators with a large disparity in eigenvalues are termed
poorly conditioned and consequently require a large number of iterations to converge, which ultimately
reduces the algorithmic advantages of these methods.
To address this limitation iterative methods are coupled with preconditioners to improve the operator
spectrum and therefore the conditioning. Preconditioners based on multilevel methods are common since they
target computing the solution in optimal O(N) complexity. The central idea of multigrid is the hierarchical
representation and solution of the original problem at various granularities. Although numerous formulations
exist they are broadly grouped into two distinct families: geometric and algebraic. Geometric multigrid
(GMG) methods were the earliest approach and generate coarse representations by exploiting the geometric
regularity of structured problems. Although GMG is highly effective for many problems, the reliance on
structure and geometry limits the general applicability. To address this limitation algebraic multigrid (AMG)
methods reduce the dependence on problem geometry; in AMG coarse representations are determined by
applying a set of heuristics directly to the linear system. Decreased knowledge concerning the problem
instance places an increased burden on AMG methods in the form of a setup in order to construct a hierarchy
of levels and transfer operators between levels. Heuristics to guide construction of an algebraic hierarchy are
2
numerous but typically rely on a mixture of algebraic and graph theoretic operations to assess the salient
features of the underlying problem.
Although the existence of optimal solvers are interesting the true potential of any computational method
is only realized when algorithms are coupled with implementations that utilize a given architecture efficiently.
This algorithm-architecture coupling has been challenging since the evolution of modern numerical methods
has overlapped with dramatic architectural shifts in computer hardware stemming from performance limi-
tations of processor cores. The stagnation of single core performance at the end of twentieth century forced
the introduction of additional cores to increase performance ushering in a new era of parallel computing.
This performance oriented paradigm shift from vertical, increased clock rates, to horizontal, increased cores
per chip, has irrevocably impacted both the performance and development of numerical algorithms.
The departure from sequential processing introduces a significantly more complex set of design decisions
to facilitate parallel computing. As illustrated in Figure 1.2 multicore CPUs are the natural extension of
their unicore predecessors. A moderate increase in the number of cores per chip decreases the per core cache
memory size but the architecture design is fundamentally the same, specifically each core still processes a
separate stream of instructions on different data, multiple processor multiple data (MIMD). Though effective
for many workloads multicore CPUs do not fully exploit abundant fine-grained data-parallel aspects of some
computational workloads. For these workloads the number of operations per datum is small but the volume
of data is massive, a typical feature of workloads from graphics rendering, therefore graphics processing units
(GPUs) that were originally designed to support graphics rendering were repurposed to allow the execution
of generalized data-parallel workloads.
CR CR
CR CR
CR
Control Unit C
R
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
Increasing number of cores
Decreasing memory per core
Regs
+
Entire Cache Hierarchy
(L1 and L2)
Memory per core
Processor core
DRAM
Bandwidth
Regs
+
L1 Regs
Figure 1.2: Latency oriented architectures utilize large on-chip cache memory to reduce expensive
accesses to global DRAM. As the number of cores increases the dedicated private memory per core
decreases. Throughput oriented architectures substantially increase the number of cores at the expense
of a small amount of private memory per core.
3
In contrast to CPUs emerging throughput-oriented processors share a single instruction stream therefore
all operations are executed across multiple cores in a single instruction multiple data (SIMD) fashion. An-
other important distinction is that each core does not have dedicated cache to reduce the expense of slow
global memory accesses, instead the bandwidth to fetch data from global memory is considerably higher.
The local memory constraint severely limits the working set size of each core to the number of allocated
registers. The repercussion of these architecture differences is the need to redesign algorithms to express
substantial parallelism and cooperative processing at exceedingly small granularities, sometimes on the order
of individual datum. With respect to previous approaches to AMG the units of computational work are too
coarse-grained for direct implementation on GPUs.
In this dissertation we focus on the parallel decomposition of smoothed aggregation based AMG methods
to solve sparse linear systems, Ax = b, on GPU architectures. AMG is particularly challenging for modern
day architectures because hierarchy construction relies on analyzing data-structures with irregular and data-
driven access patterns. We analyze the performance of the operations during both the construction and
iterative cycling phases directly on the GPU. We primarily focus on GPUs for several reasons:
• Many of the operations involved in the setup phase are memory-bound therefore it may be beneficial
to take advantage of the much higher peak bandwidth capabilities of GPUs.
• Solve phase performance is dominated by sparse matrix-vector operations which are known to perform
better in bandwidth rich environments, such as GPUs. However, performing the setup phase on
the CPU followed by transferring data to GPU incurs a substantial performance penalty due to the
relatively low bandwidth PCI-express interconnect. By constructing the hierarchy completely on the
GPU memory copy operations are reduced significantly.
• The resource constraints of parallel processing on GPUs in many cases requires rethinking traditional
CPU algorithms. By studying the algorithmic requirements to perform all operations in a throughput-
oriented manner we design our methods to support massive fine-grained parallel processing which
follows the general trend in computer architectures.
AMG is particularly challenging for modern day architectures because hierarchical construction relies on
analyzing data-structures with irregular and data-driven access patterns. We analyze the performance of
performing all operations during both the construction and iterative cycling phases directly on the GPU.
Although the most common approach to constructing GPU algorithms begin by decomposing complex al-
gorithms into a small set of optimized programs, known as kernels, we propose a decomposition based on
parallel primitives that abstracts the architecture agnostic algorithmic specification from the architecture
4
specific implementation. The resulting formulation is architecture independent and, although we specifically
study the performance implications on GPUs, our formulation extends to any architecture exposing a small
set of optimized primitives. We study the performance limiting components of this initial decomposition
and subsequently introduce architecture specific optimizations to improve the performance of a small set of
key tasks, mainly sparse matrix-sparse matrix computations. The contributions of this dissertation are:
• Decomposition and analysis of AMG methods amenable for fine-grained execution on modern highly
parallel architectures;
• Strategies to improve the performance of irregular operations involving sparse matrices through regu-
larization of data-dependent irregular workloads;
• A systematic methodology to classify, compare, and analyze the multiplication of sparse matrices based
on histograms; and
• An efficient approach to partitioning sparse matrices based on coupling breadth-first search (BFS) type
methods with spectral partitioning.
In Chapter 2 we describe a parallel decomposition of AMG for GPU architectures using a small set of
parallel primitives. Chapter 3 presents an approach to accelerate the performance of sparse matrix operations
by combining analysis of the row-wise work of the output matrix with optimized hierarchical sorting routines
executing at various levels of parallelism. Chapter 4 introduces a structured scheme to discuss and analyze
the performance of row-wise sparse matrix multiplication based on the intermediate work generated. In
Chapter 5 we propose and analyze a novel strategy to accelerate spectral graph partitioning by coupling
well-understood operations on graphs, BFS with blocked Lanczos methods.
5
Chapter 2
Algebraic Multigrid
Multigrid methods precondition large, sparse linear systems of equations and in recent years have become
a robust approach for a wide range of problems. One reason for this increased utility is the trend toward
more algebraic approaches. In the classical, geometric form of multigrid, the performance relies largely on
specialized smoothers, and a hierarchy of grids and interpolation operators that are predefined through the
geometry and physics of the problem. In contrast, algebraic-based multigrid methods (AMG) attempt to
automatically construct a hierarchy of grids and intergrid transfer operators without explicit knowledge of
the underlying problem — i.e., directly from the linear system of equations [71, 82]. Figure 2.1 depicts the
contrasting nature of the two hierarchies generated by both approaches.
Ω3
Ω2
Ω1
Ω0
P2
P1
P0
R2
R1
R0
(a) Geometric
Ω3
Ω2
Ω1
Ω0
P2
P1
P0
R2
R1
R0
(b) Algebraic
Figure 2.1: Geometric methods construct all components of the multigrid hierarchy based on regularized
coarsening strategies. In contrast algebraic methods are able to operate on unstructured discretizations
by defining coarse levels algebraically based on matrix and graph operations.
Parallel approaches to multigrid are plentiful. Algebraic multigrid methods have been successfully par-
allelized on distributed-memory CPU clusters using MPI [20, 23] and more recently with a combination of
6
MPI and OpenMP [3], to better utilize multi-core CPU nodes. While such techniques have demonstrated
scalability to large numbers of processors, they are not immediately applicable to the GPU. In particular, ef-
fective use of GPUs requires substantial fine-grained parallelism at all stages of the computation. In contrast,
the parallelism exposed by existing methods for distributed-memory clusters of traditional cores is compa-
rably coarse-grained and cannot be scaled down to arbitrarily small subdomains. Indeed, coarse-grained
parallelization strategies are qualitatively different than fine-grained strategies.
For example, it is possible to construct a successful parallel coarse-grid selection algorithm by partitioning
a large problem into sub-domains and applying an effective, serial heuristic to select coarse-grid nodes on
the interior of each sub-domain, followed by a less-effective but parallel heuristic to the interfaces between
sub-domains [43]. An implicit assumption in this strategy is that the interiors of the partitions (collectively)
contain the vast majority of the entire domain, otherwise the serial heuristic has little impact on the output.
Although this method is scalable to arbitrarily fine-grained parallelism in principle, the result is qualitatively
different. In contrast, the methods we develop do not rely on partitioning and expose parallelism to the
finest granularity — i.e., one thread per matrix row or one thread per nonzero entry.
Geometric multigrid methods were the first to be parallelized on GPUs [13, 38, 79]. These “GPGPU” ap-
proaches, which preceded the introduction of the CUDA and OpenCL programming interfaces, programmed
the GPU through existing graphics application programming interfaces (APIs) such as OpenGL and Di-
rect3d. Subsequent works demonstrated GPU-accelerated geometric multigrid for image manipulation [49]
and CFD [25] problems. Previous works have implemented the cycling stage of algebraic multigrid on
GPUs [36, 41], however hierarchy construction remained on the CPU. A parallel aggregation scheme is
described in [80] that is similar to ours based on maximal independent sets, while in [1] the effectiveness
of parallel smoothers based on sparse matrix-vector products is demonstrated. Although these works were
implemented for distributed CPU clusters, they are amenable to fine-grained parallelism as well.
In the remainder of this chapter, we first outline the basic components of AMG in an aggregation con-
text [82] and highlight the necessary sparse matrix computations used in the process then we present a
data-parallel decomposition of the entire construction amenable to fine-grained parallelism necessary for
execution on emerging throughput-oriented processors. We restrict our algorithmic attention to aggregation
methods because of the flexibility in the construction, however our development also extends to classical
AMG methods based on coarse-fine splittings [71]. Furthermore, we demonstrate the effectiveness of our
decomposition on GPUs because of their abundant processing capabilities in terms of peak floating-point op-
erations and bandwidth and their strict reliance on cooperative fine-grained parallelism to achieve adequate
performance.
7
2.1 GPU Architecture
The emergence of “massively parallel” many-core processors has inspired interest in algorithms with abundant
fine-grained parallelism. Modern GPUs are representative of a class of processors that seek to maximize the
performance of massively data parallel workloads by employing hundreds of hardware-scheduled threads
organized into tens of processor cores. Such architectures are considered throughput-oriented in contrast to
traditional, latency-oriented, CPU cores which seek to minimize the time to complete individual tasks [33].
Commensurate with this throughput, GPUs exhibit the highest utilization when the computation is regularly-
structured and evenly distributed. While such architectures offer higher absolute performance, in terms of
theoretical peak FLOPs and bandwidth, than contemporary (latency-oriented) CPUs, existing algorithms
need to be reformulated to make effective use of the GPU [60, 61, 83].
GPUs are organized into tens of multiprocessors, each of which is capable of executing hundreds of
hardware-scheduled threads as depicted in Figure 2.2. Warps of threads represent the finest granularity of
scheduled computational units on each multiprocessor with the number of threads per warp defined by the
underlying hardware. Execution across a warp of threads follows a data parallel SIMD (single instruction,
multiple data) model and performance penalties occur when this model is violated as happens when threads
within a warp follow separate streams of execution — i.e., divergence — or when atomic operations are
executed in order — i.e., serialization. Warps within each multiprocessor are grouped into a hierarchy of
fixed-size execution units known as blocks or cooperative thread arrays (CTAs); intra-CTA computation and
communication may be routed through a shared memory region accessible by all threads within the CTA.
At the next level in the hierarchy CTAs are grouped into grids and grids are launched by a host thread with
instructions encapsulated in a specialized GPU programming construct known as a kernel.
DRAM(2-6 GBs) 
SM SM SM SM SM 
SM SM SM SM SM 
Memory 
Subsystem 
L2 Cache (768 KB) 
L1 Cache 
(48/16 KB) 
Shared Memory 
(16/48 KB) 
Register File 
(128KB) 
Per SM 
Figure 2.2: Depiction of the GPU memory hierarchy.
For computations over graphs or sparse matrix operations, which are typically bandwidth-bound, GPUs
8
represent an attractive architecture because of their potential peak bandwidth, roughly six times more than
traditional CPUs [10]. The caveat, however, is that sparse matrix computations are data-dependent and
therefore susceptible to highly irregular memory access patterns and substantial work imbalances.
2.2 Components of Algebraic Multigrid
The performance of AMG relies on a compatible collection of relaxation operators, coarse grid operators,
and interpolation operators as well as the efficient construction of these operations. In this section we outline
the components of aggregation-based AMG that we consider for construction on the GPU.
Aggregation-based AMG requires a a priori knowledge or prediction of the near-nullspace that represent
the low-energy error. For an n×n symmetric, positive-definite matrix problem Ax = b, these m modes
are denoted by the n×m column matrix B. Generally, the number of near-nullspace modes, m, is a small,
problem-dependent constant. For example, the scalar Poisson problem requires only a single near-nullspace
mode while 6 rigid body modes are needed to solve three-dimensional elasticity problems. We also denote
the n×n problem as the fine level and label the indices Ω0 = {0, . . . , n − 1} as the fine grid. From A, b,
and B, the components of the solver are defined through a setup phase, and include grids Ωk, interpolation
operators Pk, restriction operators Rk, relaxation error propagation operators, and coarse representations of
the matrix operator Ak, all for each level k. We denote index M as the maximum level — e.g., M = 1 is a
two-grid method.
We follow a setup phase that is outlined in Algorithm 1. The decomposition of the graph operations in
a manner amenable to fine-grained parallelism is challenging. The setup assumes input of a sparse matrix,
A, and user-supplied vectors, B, which may represent low eigenmodes of the problem; here, we consider the
constant vector, a common default for input. The following sections detail each of the successive routines in
the setup phase: strength, aggregate, tentative, prolongator, and the triple matrix Galerkin product
described by Lines 1-6.
2.2.1 Strength-of-connection
A vertex i of the fine matrix graph that strongly influences or strongly depends on a neighboring vertex
j typically has a large, relative edge weight. As a result, the traditional approach for aggregation schemes is
9
Algorithm 1: AMG Setup: setup
parameters: A, B
return: A0, . . . , AM , P0, . . . , PM−1
A0 ← A, B0 ← B
for k = 0, . . . ,M
1 Ck ← strength(Ak) {strength-of-connection}
2 Nk ← aggregate(Ck) {construct coarse aggregates}
3 Tk, Bk+1 ← tentative(Nk, Bk) {form tentative interpolation}
4 Pk ← prolongator(Ak, Tk) {improve interpolation}
5 Rk ← PTk {transpose}
6 Ak+1 ← RkAkPk {coarse matrix, triple-matrix product}
to identify two points i and j as strongly connected if they satisfy
|A(i, j)| > θ
√
|A(i, i)A(j, j)|. (2.1)
This yields a connectivity graph represented by sparse matrix Ck. Algorithm 2 describes the complete
strength-of-connection algorithm for the COO matrix format. A parallel implementation of this algorithm
is discussed in Section 2.3.1.
Algorithm 2: Strength of connection: strength
parameters: Ak ≡ (I, J, V ), COO sparse matrix
return: Ck ≡ (Iˆ , Jˆ , Vˆ ), COO sparse matrix
M = {0, . . . , nnz(A)− 1}
D ← 0
1 for n ∈M {extract diagonal}
if In = Jn
D(In)← Vn
2 for n ∈M {check strength}
if |Vn| > θ
√|D(In)| · |D(Jn)|
(Iˆnˆ, Jˆnˆ, Vˆnˆ)← (In, Jn, Vn)
2.2.2 Aggregation
An aggregate or grouping of nodes is defined by a root node i and its neighborhood — i.e., all points j,
for which C(i, j) 6= 0, where C is a strength matrix. The standard (serial) aggregation procedure consists of
two phases:
10
1. For each node i, if i and each of its strongly connected neighbors are not yet aggregated, then form a
new aggregate consisting of i and its neighbors.
2. For each remaining unaggregated node i, sweep i into an adjacent aggregate.
The first phase of the algorithm visits each node and attempts to create disjoint aggregates from the node
and its 1-ring neighbors. It is important to note that the first phase is a greedy approach and is therefore
sensitive to the order in which the nodes are visited. We revisit this artifact in Section 2.3.2, where we devise
a parallel aggregation scheme that mimics the standard sequential algorithm up to a reordering of the nodes.
Nodes that are not aggregated in the first phase are incorporated into an adjacent aggregate in the second
phase. By definition, each unaggregated node must have at least one aggregated neighbor (otherwise it could
be the root of a new aggregate) so all nodes are aggregated after the second phase. When an unaggregated
node is adjacent to two or more existing aggregates, an arbitrary choice is made. Alternatively, an aggregate
with the largest/smallest index or the aggregate with the fewest members, etc., selected. Figure 2.3 illustrates
a typical aggregation pattern for structured and unstructured meshes.
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
(a) Structured Mesh Aggregates
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1617
18
19
20
21
22
23
24
25
26 27
28
29
30
31
32
33
34
35
36
37
38
39
4041
42
43
44
45 46
(b) Unstructured Mesh Aggregates
Figure 2.3: Example of a mesh (gray) and aggregates (outlined in black). Nodes are labeled with the
order in which they are visited by the sequential aggregation algorithm and the root nodes, selected in
the first phase of the algorithm, are colored in gray. Nodes that are adjacent to a root node, such as
nodes 1 and 6 in 2.3a are aggregated in phase 1. Nodes that are not adjacent to a root node, such as
nodes 8, 16, and 34 in 2.3a are aggregated in second phase.
The aggregation process results in a sparse matrix N , which encodes the aggregates using the following
scheme,
N (i, j) =

1 if the ith node is contained in the jth aggregate
0 otherwise.
(2.2)
11
2.2.3 Tentative Prolongator
From an aggregation, N , and a set of coarse-grid candidate vectors, B, a tentative interpolation operator
is defined. The tentative interpolation operator or prolongator matrix T is constructed so that each row
corresponds to a grid point and each column corresponds to an aggregate. When there is only one candidate
vector, the sparsity pattern of the tentative prolongator is exactly the same as N .
B
(a) Initial graph and near nullspace B.
4
3
0
1
5
6
2
(b) Following the aggregation operation B is re-
stricted to being locally supported over each ag-
gregate.
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
3
3
3
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
2
2
2
2
2
2
(c) The locally restricted B is normalized over
each aggregate independently using QR.
Figure 2.4: Following the definition of the coarse nodes specified during aggregation the near nullspace
vector, B, is used to construct values of the tentative prolongator T .
Specifically, with the sparsity pattern induced by Nk, the numerical entries of Tk are defined by the
conditions,
Bk = TkBk+1, T
T
k Tk = I, (2.3)
which imply that (1) the near-nullspace candidates lie in the range of Tk, and (2) that columns of Tk are
orthonormal as illustrated in Figure 2.4. For instance, in an example with five fine-grid nodes (with two
12
coarse aggregates), we have matrices
Bk =

B1,1
B2,1
B3,1
B4,1
B5,1

, Tk =

B1,1/C1
B2,1/C1
B3,1/C2
B4,1/C2
B5,1/C2

, Bk+1 =
 C1
C2
 , (2.4)
that satisfy the interpolation (Bk = TkBk+1) and orthonormality conditions (T
T
k Tk = I), using the scaling
factors C1 = ||[B1,1, B2,1]|| and C2 = ||[B3,1, B4,1, B5,1]||.
Although we only consider the case of a single candidate vector, for completeness we note that when
Bk contains m > 1 low-energy vectors, the tentative prolongator takes on a block structure. An important
component of the setup phase is to express these operations efficiently on the GPU.
2.2.4 Prolongator Smoothing
The tentative prolongation operator is a direct attempt to enforce the range of interpolation to coincide
with the (user-provided) near null-space modes. This has limitations however, since the modes may not
accurately represent the “true” near null-space and since the interpolation is still only local, and thus
limited in accuracy. One approach to improving the properties of interpolation is to smooth the columns of
the tentative prolongation operator. With weighted-Jacobi smoothing, for example, the operation computes
a new sparse matrix P whose columns,
Pcolj = (I − ωD−1A)Tcolj , (2.5)
are the result of applying one-iteration of relaxation to each column of the tentative prolongator. In practice,
we compute all columns of P in a single operation using a specialized sparse matrix-matrix multiplication
algorithm. Figure 2.5 illustrates the impact of smoothing the tentative prolongator prior to performing the
Galerkin product. Smoothing the prolongator not only changes the values but also the number of nonzeros
per row on the coarse matrix.
13
Figure 2.5: Tentative set of direct coarse neighbor edges (black) are augmented with long range edges
(red) during the smoothing process. Adding additional edges increases the number of entries per row of
the accompanying coarse matrix.
2.2.5 Galerkin Product
The construction of the sparse Galerkin product Ak+1 = RkAkPk in Line 6 of Algorithm 1 is typically
implemented with two separate sparse matrix-matrix multiplies — i.e., (RA)P or R(AP ) as shown in
Figure 2.6. Either way, the first product is of the form [n×n] ∗ [n×nc] (or the transpose), while the second
product is of the form [nc × n] ∗ [n× nc].
P0Ω0
Ω1R0
R
0
×
Ω 0
×
P 0
Figure 2.6: The unstructured nature of P and A makes generation of the coarse operator challenging.
Efficient sequential sparse matrix-matrix multiplication algorithms are described in [6, 40], where the
Compressed Sparse Row (CSR) format is used, which provides O(1) indexing of the matrix rows. As a
14
result, the restriction matrix Rk = P
T
k is formed explicitly in CSR format before the Galerkin product is
computed. While the sequential algorithms for sparse matrix-matrix multiplication are efficient, they rely on
a large amount of (per thread) temporary storage, and are therefore not suitable for fine-grained parallelism.
Specifically, to compute the sparse product C = A ∗B, the sequential methods use O(N) additional storage,
where N is the number of columns in C. In contrast, our approach to sparse matrix-matrix multiplication,
detailed in Section 2.3.4, is formulated in terms of highly-scalable parallel primitives with no such limitations.
Indeed, our formulation exploits parallelism at the level of individual matrix entries.
2.2.6 Spectral radius
For the smoothers such as weighted Jacobi or Chebyshev, a calculation of the spectral radius of a matrix —
i.e., the eigenvalue of the matrix with maximum modulus — is often needed in order to yield effective
smoothing properties. These smoothers are central to the relaxation scheme in the cycling phase and the
prolongation in the setup phase, so we consider the computational impact of these calculations. Here, we
use an approximation of the spectral radius of D−1A where D is matrix of the diagonal of A.
To produce accurate estimates of the spectral radius we use an Arnoldi iteration which reduces A to
an upper Hessenberg matrix H by similarity transformations. The eigenvalues of the small fixed-size dense
matrix H are then computed directly. In Section 2.5, we see that computing the spectral radius has a
non-trivial role in the total cost of the setup phase.
2.2.7 Cycling Phase
The multigrid cycling or solve phase is detailed in Algorithm 3 and illustrated in Figure 2.7. Several
computations are executed at each level in the algorithm, but as we see in Lines 2-6, the operations are
largely sparse matrix-vector multiplications (SpMV). Consequently, on a per-level basis, we observe a strong
relationship between the performance of the SpMV and the computations in Lines 2-6. For example, the
smoothing sweeps on Lines 2 and 6 are both implemented as affine operations such as xk ← xk−ωD−1(Axk−
b), in the case of weighted Jacobi. This is a highly parallelized AXPY operation as well as a SpMV. This is
also the case for the residual on Line 3, and the restriction and interpolation operations on Lines 4 and 5.
Finally, we coarsen to only a few points so that the coarse solve on Line 1 is efficiently handled by either
relaxation or an arbitrary direct solver.
While the computations in the solve phase are straightforward at a high-level, they rely heavily on the
ability to perform unstructured SpMV operations efficiently. Indeed, even though the fine level matrix may
exhibit a well-defined structure, the unstructured nature of the aggregation process produces coarse-level
15
Solve
Presmooth
Restrict Prolongate
Postsmooth
Figure 2.7: Illustration of two-level V-cycle.
Algorithm 3: AMG Solve: solve
parameters: Ak, Rk, Pk, xk, bk
return: xk, solution vector
if k = M
1 solve Akxk = bk
else
2 xk ← presmooth(Ak, xk, bk, µ1) {smooth µ1 times on Akxk = bk}
3 rk ← bk −Akxk {compute residual}
4 rk+1 ← Rkrr {restrict residual}
ek+1 ← solve(Ak+1, Rk+1, Pk+1, ek+1, rk+1) {coarse-grid solve}
5 ek ← Pkek+1 {interpolate solution}
xk ← xk + ek {correct solution}
6 xk ← postsmooth(Ak, xk, bk, µ2) {smooth µ2 times on Akxk = bk}
matrices with less structure. In Section 2.5 we examine the cost of the solve phase in more detail.
2.2.8 Primitives
Our method for exposing fine-grained parallelism in AMG leverages data parallel primitives [12, 75]. We use
the term primitives to refer to a collection of fundamental algorithms that emerge in numerous contexts such
as reduction, parallel prefix-sum (or scan), and sorting. In short, these primitives are to general-purpose
computations what BLAS [54] is to computations in linear algebra. Like BLAS, the practice of distilling
complex algorithms into parallel primitives has several advantages versus authoring ad hoc codes:
productivity: programming with primitives requires less developer effort,
performance: primitives can be optimized for the target architecture independently,
portability: applications can be ported to a new architecture more easily.
Given the broad scope of their usage, special emphasis has been placed on the performance of primitives
and very highly-optimized implementations are readily available for the GPU [58, 61, 75]. By basing our
16
AMG solver on these primitives we were able to implement the necessary components exclusively with the
parallel primitives provided by the Thrust library [45]. This construction method is similar to the techniques
typically utilized in the construction of dense linear algebra routines which base a complex set of architecture
independent operations, such as those found in LAPACK, on a smaller set of reusable architecture dependent
primitives, such as BLAS. Decoupling the algorithm from the implementation allows our implementation to
achieve good performance while remaining flexible enough to execute on different architectures which expose
the Thrust primitives.
2.3 Parallel Construction
In this section we describe an implementation of the multigrid setup phase (cf. Algorithm 1) using parallel
primitives to expose the fine-grained parallelism required for GPU acceleration. Our primary contributions
are parallel algorithms for aggregation and sparse matrix-matrix multiplication.
The methods described in this section are designed for the coordinate (COO) sparse matrix format. The
COO format is comprised of three arrays I, J, and V, which store the row indices, column indices, and values,
respectively, of the matrix entries. We further assume that the COO entries are sorted by ascending row
index. Although the COO representation is not generally the most efficient sparse matrix storage format, it
is simple and easily manipulated.
2.3.1 Strength of Connection
The symmetric strength-of-connection algorithm (cf. Section 2.2.1) is straightforward to implement using
parallel primitives. Given a matrix A in coordinate format we first extract the matrix diagonal by comparing
the row index array, I, to the column index array, J,
D = [0, 0, 0, 0],
is diagonal = transform(I, J, equals),
= [true, false, false, true, false, false, true, false, false, true],
D = scatter if(V, I, is diagonal, D),
= [2, 2, 2, 2],
and scattering the corresponding entries in the values array, V, when they are equal.
The second loop in Algorithm 2 is implemented with stream compaction. First we obtain arrays containing
17
D[i] and D[j] for each index in I and J respectively,
Di = gather(I, D),
= [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
Dj = gather(J, D),
= [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
from which the strength-of-connection threshold — i.e., θ
√|D(In)| · |D(Jn)| in Algorithm 2 — of each entry
is computed:
threshold = transform(Di, Dj, soc threshold(theta)),
using a specially-defined functor soc threshold. Next, each entry in the values array, V, is tested against
the corresponding threshold to determine which entries are strong,
is strong = transform(V, threshold, greater),
= [true, true, true, true, true, true, true, true, true, true].
Finally, the matrix entries corresponding to strong connections are determined using stream compaction to
form a coordinate representation of the strength-of-connection matrix, namely
Ci = copy if(I, is strong, identity),
Cj = copy if(J, is strong, identity),
Cv = copy if(V, is strong, identity).
In practice we do not construct arrays such as Di and threshold explicitly in memory. Instead, the
gather and transform operations are fused with subsequent algorithms to conserve memory bandwidth
using Thrust’s permutation iterator and transform iterator. Similarly, the three calls to copy if are
combined into a single stream compaction operation using a zip iterator. The translation from the explicit
version of the algorithm (shown here) and the more efficient, fused version (not shown) is only a mechanical
transformation.
18
2.3.2 Aggregation
The sequential aggregation method (cf. Section 2.2.2) is designed to produce aggregates with a particular
structure. Unfortunately, the greedy nature of the algorithm introduces sequential dependencies that prevent
a direct parallelization of the algorithm’s first phase. In this section we describe a fine-grained parallel
analog of the sequential method based on a generalized maximal independent set algorithm which produces
aggregates with the same essential properties. A similar parallel aggregation strategy is described in [80],
albeit with a different implementation.
There are two observations regarding the aggregation depicted in Figure 2.3 that lead to the description of
our aggregation method. First, no two root nodes of the aggregates are within a distance of two edges apart.
Second, if any unaggregated node is separated from all existing root nodes by more than two edges then it is
free to become the root of a new aggregate. Together, these conditions define a collection of root nodes that
are a distance−2 maximal independent set, which we denote MIS(2) and formalize in Definition 2.3.1. The
first property ensures independence of the nodes — i.e., that the minimum distance between any pair of root
nodes exceeds a given threshold. The second property ensures maximality — i.e., no additional node can
be added to the set without violating the property of independence. The standard definition of a maximal
independent set, which we denote MIS(1), is consistent with this definition except with a distance of one.
We defer a complete description of the generalized maximal independent set algorithm to the Appendix. For
the remainder of this section, we assume that a valid MIS(2) is computed efficiently in parallel.
Definition 2.3.1 (MIS(k)) Given a graph G = (V,E), let Vroot ⊂ V be a set of root nodes, and denote by
dG(·, ·) the distance or shortest path between two vertices in the graph. Then, Vroot is a maximal independent
set of distance k, or MIS(k) if the following hold:
1. (independence) Given any two vertices u, v ∈ Vroot, then dG(u, v) > k.
2. (maximality) There does not exist u ∈ V \ Vroot such that dG(u, v) > k for all v ∈ Vroot.
Given a valid MIS(2) the construction of aggregates is straightforward. Assume that the MIS(2) is
specified by an array of values in {0, 1}, where a value of 1 indicates that corresponding node is a member
of the independent set and value of 0 indicates otherwise. We first enumerate the MIS(2) nodes with an
exclusive scan operation, giving each set node a unique index in [0, N − 1], where N is the number of
nodes in the set. For example, on a linear graph with ten nodes, a MIS(2) with four set nodes,
mis = [1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
19
is enumerated as
enum = exclusive scan(mis, 0),
= [0, 1, 1, 1, 2, 2, 2, 3, 3, 3].
Since the i-th node in the independent set serves as the root of the i-th aggregate, the only remaining task
it so propagate the root indices outward.
The root indices are communicated to their neighbors with two operations resembling a sparse matrix-
vector multiply, y = Ax. Conceptually, each unaggregated node looks at neighbors for the index of an
aggregate. In the first step, all neighbors of a root node receive the root node’s aggregate index — i.e., the
value resulting from the exclusive scan operation. In the second step, the remaining unaggregated nodes
receive the aggregate indices of their neighbors, at least one of which must belong to an aggregate. As before,
in the presence of multiple neighboring aggregates a selection is made arbitrarily. The two sparse matrix-
vector operations are analogous to the first and second phases of the sequential algorithm respectively — cf.
Section 2.2.2. Our implementation of the parallel aggregate propagation step closely follows the existing
techniques for sparse matrix-vector multiplication [10].
Figure 2.8 depicts a MIS(2) for a regular mesh and the corresponding aggregates rooted at each inde-
pendent set node. Although the root nodes are selected by a randomized process (see the Appendix) the
resulting aggregates are qualitatively similar to those chosen by the sequential algorithm in Figure 2.3a.
Indeed, with the appropriate permutation of graph nodes, the sequential aggregation algorithm selects the
same root nodes as the randomized MIS(2) method.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1617
18
19
20
21
22
23
24
25
26 27
28
29
30
31
32
33
34
35
36
37
38
39
4041
42
43
44
45
46
(a) MIS(2)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1617
18
19
20
21
22
23
24
25
26 27
28
29
30
31
32
33
34
35
36
37
38
39
4041
42
43
44
45
46
(b) Aggregates
Figure 2.8: Parallel aggregation begins with a MIS(2) set of nodes, colored in gray, and represent the
root of an aggregate — e.g., node 18. As in the sequential method, nodes adjacent to a root node are
incorporated into the root node’s aggregate in the first phase. In the second phase, unaggregated nodes
join an adjacent aggregate — e.g., nodes 12, 19, and 24 for root node 18.
20
2.3.3 Prolongation and Restriction
The tentative prolongation, prolongation smoothing, and restriction construction steps of Algorithm 1
(Lines 3, 4, and 5), are also expressed with parallel primitives. The tentative prolongation operation is
constructed by gathering the appropriate entries from the near-nullspace vectors stored in B according to
the sparsity pattern defined by N . Then, the columns are normalized, first by transposing the matrix, which
has the effect of sorting the matrix entries by column index, and then computing the norm of each column
using the reduce by key algorithm. Specifically, the transpose of a coordinate format matrix such as
I =
[
0 1 2 3 4 5
]
,
J =
[
0 1 1 0 1 0
]
,
V =
[
0 0 1 1 2 2
]
,
is computed with by reordering the column indices of the matrix, and applying the same permutation to the
rows and values
TransI, Permutation = stable sort by key(J, [0, 1, 2, 3, 4])
=
[
0 0 0 1 1 1
]
,
[
0 3 5 1 2 4
]
,
TransJ = gather(Permutation, I)
=
[
0 3 5 1 2 4
]
,
TransV = gather(Permutation, V)
=
[
0 1 2 0 1 2
]
.
Then the squares of the transposed values array are calculated, followed by row sums,
Squares = tranform(TransV, TransV, multiplies)
=
[
0 1 4 0 1 4
]
,
Rows, Sums = reduce by key(TransI, Squares, multiplies)
=
[
0 1
]
,
[
9 9
]
,
21
which correspond to column sums of the original matrix.
Next, the final prolongator P is obtained by smoothing the columns of T according to the formula in
Section 2.2.4. Here, we apply a specialized form of the general sparse matrix-matrix multiplication scheme
described in Section 2.3.4. Specifically, we exploit the special structure of the tentative prolongator, whose
rows contain at most one nonzero entry, when computing the expression AT .
Finally, the transpose of the prolongation operator is calculated explicitly R = PT , and Galerkin triple-
matrix product Ak+1 = Rk(AkPk) is computed with two separate sparse matrix-matrix multiplies. As
mentioned above, the transpose operation is fast, particularly for the COO format.
2.3.4 Sparse Matrix-Matrix Multiplication
Efficiently computing the product C = AB of sparse matrices A and B is challenging, especially in parallel.
The central difficulty is that, for irregular matrices, the structure of the output matrix C has a complex and
unpredictable dependency on the input matrices A and B.
The sequential sparse matrix-matrix multiplication algorithm mentioned in Section 2.2.5 do not admit
immediate, fine-grained parallelization. Specifically, for matrices A and B of size [k×m] and [m×n] the
method requires O(n) temporary storage to determine the entries of each sparse row in the output. As a
result, a straightforward parallelization of the sequential scheme requires O(n) storage per thread, which is
untenable when seeking to develop tens of thousands of independent threads of execution. While it is possible
to construct variations of the sequential method with lower per-thread storage requirements, any method
that operates on the granularity of matrix rows — i.e., distributing matrix rows over threads — requires a
non-trivial amount of per-thread state and suffers load imbalances for certain input. As a result, we have
designed an algorithm for sparse matrix-matrix multiplication based on sorting, segmented reduction, and
other operations which are well-suited to fine-grained parallelism as discussed in Section 2.2.8.
As an example, we demonstrate our algorithm for computing C = AB, where
A =
 5 10 0
15 0 20
 and B =

25 0 30
0 35 40
45 0 50
 , (2.6)
22
have 4 and 6 nonzero entries respectively. The matrices are stored in coordinate format as
A =

(0, 0, 5)
(0, 1, 10)
(1, 0, 15)
(1, 2, 20)

and B =

(0, 0, 25)
(0, 2, 30)
(1, 1, 35)
(1, 2, 40)
(2, 0, 45)
(2, 2, 50)

, (2.7)
where each (i, j, v) tuple of elements represents the row index, column index, and value of the matrix entry.
We note that the (i, j, v) tuples are only a logical construction used to explain the algorithm. In practice
the coordinate format is stored in a “structure of arrays” layout with three separate arrays.
To expose fine-grained parallelism, our parallel sparse matrix-matrix multiplication algorithm proceeds
in three stages:
1. Expansion of AB into a intermediate coordinate format T
2. Sorting of T by row and column indices to form Tˆ
3. Compression of Tˆ by summing duplicate values for each matrix entry
In the first stage, T is formed by multiplying each entry A(i, j) of the left matrix with every entry in row j
of B, B(j, k) for all k. Here, the intermediate format is
T =

(0, 0, 125)
(0, 2, 150)
(0, 1, 350)
(0, 2, 400)
(1, 0, 375)
(1, 2, 450)
(1, 0, 900)
(1, 2, 1000)

, (2.8)
The expansion stage is implemented in parallel using gather, scatter, and prefix-sum operations.
The result of the expansion phase is an intermediate coordinate format with possible duplicate entries that
is sorted by row index, but not by column index. The second stage of the sparse matrix-matrix multiplication
algorithm sorts the entries of T into lexicographical order. For example, sorting the entries of T by the (i, j)
23
coordinate yields
Tˆ =

(0, 0, 125)
(0, 1, 350)
(0, 2, 150)
(0, 2, 400)
(1, 0, 375)
(1, 0, 900)
(1, 2, 450)
(1, 2, 1000)

, (2.9)
from which the duplicate entries are easily identified. The lexicographical reordering is efficiently imple-
mented with two invocations of Thrust’s stable sort by key algorithm.
The final stage of the algorithm compresses duplicate coordinate entries while summing their corre-
sponding values. Since Tˆ is sorted by row and column, the duplicate entries are stored contiguously and are
compressed with a single application of the reduce by key algorithm. In our example Tˆ becomes
C =

(0, 0, 125)
(0, 1, 350)
(0, 2, 550)
(1, 0, 1275)
(1, 2, 1450)

, (2.10)
where each of the duplicate entries have been combined. The compressed result is now a valid, duplicate-free
coordinate representation of the matrix
AB =
 125 350 550
1275 0 1450
 .
All three stages of our sparse matrix-matrix multiplication algorithm expose ample fine-grained paral-
lelism. Indeed, even modest problem sizes generate enough independent threads of execution to fully saturate
the GPU. Furthermore, because we have completely flattened the computation into efficient data-parallel
algorithms — i.e., gather, scatter, scan, stable sort by key, etc. — our implementation is insensitive to
the (inherent) irregularity of sparse matrix-matrix multiplication. As a result, even pathological inputs do
not create substantial imbalances in the work distribution among threads.
Another benefit is that our parallel sparse matrix-matrix algorithm has the same computational com-
24
plexity as the sequential method, which is O(nnz(T )), the number of entries in our intermediate format.
Therefore our method is “work-efficient” [12], since (1) the complexity of the sequential method is propor-
tional to the size of the intermediate format T and (2) the work involved at each stage of the algorithm
is linear in the size of T . The latter claim is valid since Thrust employs work-efficient implementations of
parallel primitives such as scan, reduce by key and stable sort by key.
One practical limitation of the method as described above is that the memory required to store the
intermediate format is potentially large. For instance, if A and B are both square, n×n matrices with
exactly K entries per row, then O(nK2) bytes of memory are needed to store T . Since the input matrices
are generally large themselves (O(nK) bytes) it is not always possible to store a K-times larger intermediate
result in memory. In the limit, if A and B are dense matrices (stored in sparse format) then O(n3) storage
is required. As a result, our implementation allocates a workspace of bounded size and decomposes the
matrix-matrix multiplication C = AB into several smaller operations of the form C(slice, :) = A(slice, :)B,
where slice is a subset of the rows of A. The final result is obtained by simply concatenating the coordinate
representations of all the partial results together C = [C(slice 0, :), C(slice 1, :), . . .]. In practice this sub-
slicing technique introduces little overhead because the workspace is still large enough to fully utilize the
device.
2.4 Parallel Multigrid Cycling
After the multigrid hierarchy has been constructed using the techniques in Section 2.3, the cycling of
Algorithm 3 proceeds. In this section we describe the components of the multigrid cycling and how they are
parallelized on the GPU.
2.4.1 Vector Operations
In Algorithm 3, the residual vector computation and the coarse grid correction steps require vector-
vector subtraction and addition respectively. While these operations could be fused with the preceding
sparse matrix-vector products for potential efficiency, or could be carried out with DAXPY in CUBLAS [26],
we have implemented equivalent routines with Thrust’s transform algorithm for simplicity. Similarly, the
vector norms (DNRM2) and inner products (DDOT) that arise in multigrid cycling have been implemented with
reduce and inner product in Thrust respectively.
25
2.4.2 Sparse Matrix-Vector Multiplication
Sparse matrix-vector multiplication (SpMV), which involves (potentially) irregular data structures and
memory access patterns, is more challenging to implement than the aforementioned vector operations. Never-
theless efficient techniques exist for matrices with a wide variety of sparsity patterns [7, 10, 13, 21, 30, 85, 86].
Our implementations of sparse matrix-vector multiplication are described in [10]. In Algorithm 3 sparse
matrix-vector multiplication is used to compute the residual, to restrict the fine-level residual to the coarse
grid, to interpolate the coarse-level solution onto the finer grid, and in many cases, to implement the pre-
and post-smoother.
In Section 2.3 we describe a method for constructing the AMG hierarchy in parallel using the coordinate
(COO) sparse matrix format. The COO format is simple to construct and manipulate, and therefore is well-
suited for the computations that arise in the setup phase. However, the COO format is generally not the most
efficient for the SpMV operations [10]. Fortunately, once the hierarchy is constructed it remains unchanged
during the cycling phase. As a result, it is beneficial to convert the sparse matrices stored throughout the
hierarchy to an alternative format that achieves higher SpMV performance.
Conversions between COO and other sparse matrix formats, such as CSR, DIA, ELL, and HYB [10],
are inexpensive, as shown in Table 2.1. Here we see that the conversion from COO to CSR is trivial, while
to more structured formats such as ELL and HYB is of minimal expense (note: the conversion to itself
represents a straight copy). When reporting performance figures in Section 2.5 we include the COO to HYB
conversion time in the setup phase.
From\To COO CSR ELL HYB
COO 5.09 6.35 20.10 23.55
CSR 7.80 4.03 21.61 24.84
ELL 17.02 18.26 5.90 22.32
HYB 63.27 69.40 83.08 4.12
Table 2.1: Sparse matrix conversion times (ms) for an unstructured mesh with 1M vertices and 8M
edges.
2.4.3 Smoothing
Gauss-Seidel relaxation is a popular multigrid smoother with several attractive properties. For instance,
the method requires only O(1) temporary storage and converges for any symmetric, positive-definite ma-
trix. Unfortunately, the standard Gauss-Seidel scheme does not admit an efficient parallel implementation.
26
Jacobi relaxation is a simple and highly parallel alternative to Gauss-Seidel, albeit without the same com-
putational properties. Whereas Gauss-Seidel updates each unknown immediately, Jacobi’s method updates
all unknowns in parallel, and therefore requires a temporary vector. Additionally, a weighting or damping
parameter ω must be incorporated into Jacobi’s method to ensure convergence. The weighted Jacobi method
is written in matrix form as, I − ωρ(D−1A)D−1A, where D is a matrix containing the diagonal elements of
A. Since the expression is simply a vector operation and a sparse matrix-vector multiply, Jacobi’s method
exposes substantial fine-grained parallelism.
We note that sparse matrix-vector multiplication is the main workhorse for several other highly-parallel
smoothers. Indeed, so-called polynomial smoothers
x← x+ P (A)r, (2.11)
where P (A) is a polynomial in matrix A, are almost entirely implemented with sparse matrix-vector products.
We refer to [1] for a comprehensive treatment of parallel smoothers and their associated trade-offs.
2.5 Experiments
In this section we examine the performance of a GPU implementation of the proposed method. We investigate
both the setup phase of Algorithm 1 and the solve phase of Algorithm 3, and find tangible speed-ups in each.
2.5.1 Test Platforms
The specifications of our test system are listed in Table 2.2. Our system is configured with CUDA v4.0 [66]
and Thrust v1.4 [45]. As discussed in Section 2.2.8, Thrust provides many highly optimized GPU parallel
algorithms for reduction, sorting, etc. The entire setup phase, and most of the cycling phase, of our GPU
method is implemented with Thrust. As a basis for comparison, we also consider the Intel Math Kernel
Testbed
GPU NVIDIA Tesla C2050
CPU Intel Core i7 950
CLOCK 3.07 GHz
OS Ubuntu 10.10
Host Compiler GCC 4.4.5
Device Compiler NVCC 4.0
Table 2.2: Specifications of the test platform.
Library (MKL) version 10.3. MKL provides sparse matrix-matrix multiplication as well as sparse BLAS
27
subroutines such as sparse matrix-vector multiplication.
The Trilinos Project provides a smoothed aggregation-based AMG preconditioner for solving large, sparse
linear systems in the ML package [34]. In our comparison, we use Trilinos version 10.6 and specifically ML
version 5.0 for the solver. The ML results are presented in order to provide context for the performance of
our solver in comparison to a well-known software package.
2.5.2 Distance-k Maximal Independent Sets
In this section we describe an efficient parallel algorithm for computing distance-k maximal independent
sets, denoted MIS(k) and defined in Definition 2.3.1. We begin with a discussion of the standard distance-
1 maximal independent set — i.e., MIS(1) — and then detail the generalization to arbitrary distances.
Our primary interest is in computing a MIS(2) to be used in the parallel aggregation scheme discussed in
Section 2.3.2.
Computing a distance-1 maximal independent set is straightforward in a serial setting, as shown by
Algorithm 4. The algorithm is greedy and iterates over nodes, labeling some as MIS nodes and their
neighbors as non-MIS nodes. Specifically, all nodes are initially candidates for membership in the maximal
independent set s and labeled (with value 0) as undecided. When a candidate node is encountered it is
labeled (with value 1) as a member of the MIS, and all candidate neighbors of the MIS node are labeled
(with value −1) as being removed from the MIS. Upon completion, the candidate set is empty and all nodes
are labeled with either a −1 or 1.
Algorithm 4: MIS serial
parameters: A, n×n sparse matrix
return: s, independent set
I = {0, . . . , n− 1} {initial candidate index set}
s← 0 {initialize to undecided}
for i ∈ I {for each candidate}
if si = 0 {if unmarked. . .}
si = 1 {add to candidate set}
for j such that Aij 6= 0
sj = −1 {remove neighbors from candidate set}
s = {i : si = 1} {return a list of MIS nodes}
Computing maximal independent sets in parallel is challenging, but several methods exist. With k = 1,
our parallel version in Algorithm 5 can be considered a variant of Luby’s method [56] which has been
employed in many codes such as ParMETIS [48]. A common characteristic of such schemes is the use of
randomization to select independent set nodes in parallel.
28
Algorithm 5: MIS parallel
parameters: A, n×n sparse matrix; k, edge distance
return: s, independent set
I = {0, . . . , n− 1}
s← 0 {initialize state as undecided}
v ←random {initialize value}
while {i ∈ I : si = 0} 6= ∅
for i ∈ I {for each node in parallel}
Ti ← (si, vi, i) {set tuple (state,value,index)}
1 for r = 1, . . . , k {propagate distance k}
for i ∈ I {for each node in parallel}
t← Ti
for j such that Aij 6= 0
t← max(t, Tj) {maximal tuple among neighbors}
Tˆi ← t
T = Tˆ
2 for i ∈ I {for each node in parallel}
(smax, vmax, imax)← Ti
if si = 0 {if unmarked. . .}
if imax = i {if maximalldots}
si ← 1 {add to set}
else if smax = 1 {otherwise. . .}
si ← −1 {remove from set}
s = {i : si = 1} {return a list of MIS nodes}
29
As with the serial method, all nodes are initially labeled (with a 0) as a candidate for membership in the
MIS s. Additionally, each node is assigned a random value in the array v. The purpose of the random value
is to create disorder in the selection of nodes, allowing many nodes to be added to the independent set at
once. Specifically, the random values represent the precedence in which nodes are considered for membership
in the independent set. In the serial method this precedence is implicit in the graph ordering. Figure 2.9
illustrates a two-dimensional graph with random values values drawn from integers in [0, n = 36).
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
(a) Natural enumeration.
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
(b) Random enumeration.
Figure 2.9: A structured graph with a natural order of nodes and a randomized enumeration.
The algorithm iterates until all nodes have been labeled with a −1 or 1, classifying them as a non-MIS
node or MIS node respectively. In each iteration, an array T of 3-tuples is created, tying together the node
state — i.e., −1, 0, or 1 — the node’s random value, and the linear index of the node. In a second phase,
the nodes compute, in parallel, the maximum tuple among the neighbors. Given two tuples ti = (si, vi, i)
and tj = (sj , vj , j) the maximum is determined by a lexicographical ordering of the tuples. This ordering
ensures that MIS nodes have a priority over candidate nodes and that candidate nodes have priority over
non-MIS nodes. In the third phase, the candidate node states are updated based on the results of the second
phase. Candidate nodes that are the local maximum — i.e., Imax = i — are added to the independent set
while those with an existing MIS neighbor — i.e., Imax 6= i and smax = 1 — are removed from candidacy.
Since it is impossible for two neighboring nodes to be local maxima, the selected nodes are independent
by construction. Furthermore, the correctness of the algorithm does not depend on the random values.
Indeed, if all the random values are 0, the algorithm degenerates into the serial algorithm since the third
component of the tuple, the node index, establishes precedence among neighbors with the same random
value. In each iteration of the algorithm at least one candidate node’s state is changed, so termination is
assured. Figure 2.10 illustrates the classification of nodes during six iterations of the parallel algorithm.
30
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
Figure 2.10: Parallel MIS construction for k = 1.
Note that the parallelism in Algorithm 5 fine-grained, exposing one thread of independent execution per
node within each parallel for loop. The first and third phases are simple elementwise vector operations which
are carried out with the transform algorithm in our implementation. The second stage, which propagates
the maximal tuples, is implemented with a generalization of sparse matrix-vector multiplication. In either
case there is an implicit global synchronization among threads upon exiting the parallel for loop over nodes.
Generalizing the parallel distance-1 maximal independent set algorithm to compute MIS(k) for arbitrary
k can be accomplished in several ways. One solution is to compute the k-th power of A as an explicit matrix
AK = A ∗ A ∗ . . . A using sparse matrix-matrix multiplication, and then to apply the standard distance-1
maximal independent set algorithm. However, computing Ak explicitly is computationally expensive and
the storage for Ak grows rapidly with k. An alternative approach, which is generally superior, uses k > 1 in
Algorithm 5 so that maximum tuples are propagated k times using A, which has the same effect as one step
of Ak. Line 1 of Algorithm 5 illustrates this scheme, with an additional optimization in the form of a second
state-update pass. The second update pass on Line 2 exploits the fact that many nodes can be immediately
classified based on the results of the first pass without another iteration. Specifically, nodes whose maximum
neighbor was classified as an independent set node — i.e., snmax = 0 — can be safely removed from candidacy.
This optimization generally reduces the number of outer iterations by anticipating and efficiently applying
the effect of the next iteration. An example of distance-2 MIS is depicted in Figure 2.11.
In our implementation of Algorithm 5 the random values are produced by a hash function, vi = hash(i),
31
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
9 21 35 22 26 27
18 25 33 8 0 23
3 14 13 17 10 6
19 34 7 4 11 29
2 28 12 31 16 1
30 5 20 15 32 24
Figure 2.11: Parallel MIS construction for k = 2.
where hash is a simple integer hash function. Although not a source of high quality random numbers, the
resulting values are adequate for our purpose. More sophisticated hash-based random number generators
are discussed in [81, 88].
Figure 2.12 shows the results of an empirical test of our MIS(2)-based aggregation algorithm. We compare
the convergence rate of a solver constructed with the standard serial aggregation algorithm against the
convergence rate of a solver constructed with our parallel aggregation algorithm when applied to a small
2D Poisson problem. In each trial the rows and columns of the matrix are permuted randomly using a
high-quality pseudo-random number generator and a two-level hierarchy is constructed from using the serial
and parallel aggregation schemes on the permuted matrix. Randomly permuting the matrix has the effect
of randomizing the order in which the sequential aggregation algorithm visits nodes.
0.72 0.74 0.76 0.78 0.80
convergence factors
0
200
400
600
800
1000
1200
1400
1600
fr
eq
u
en
cy
Serial
Parallel
Figure 2.12: Distribution of convergence factors for serial and parallel aggregation methods.
32
The close agreement of the two distributions offers empirical evidence that our hash-based randomiza-
tion method is adequate for the purpose of computing aggregates in parallel. Indeed, the average (mean)
convergence rates of the serial and parallel methods are 0.7582 and 0.7580 respectively.
2.5.3 Model Problem
Table 2.3 describes the sparse matrices considered in our performance evaluation. We present results on both
structured and unstructured grids derived from a tessellation of the unit square, unit cube, and horseshoe.
The unstructured meshes are constructed with no inherent structure and the nature of the tessellation is
reflected in the sparse matrix, where block and banded patterns are not easily deduced, as they are in the
structured case. We consider this case here, since many preconditioners rely on the structured nature of the
problem, whereas one advantage of AMG is its relative indifference to matrix structure. Additionally, we
also study anisotropic rotated diffusion with an angle of rotation of pi/6 and a diffusion coefficient of 10−3
for two meshes, 5a. and 5bin Table 2.3, using θ = 0.08 for the symmetric strength-of-connection measure.
The problem we consider is a 2D and 3D Poisson problem with Dirichlet boundary conditions. Since
AMG is known to perform well on such problems, this choice allows us to focus directly on the efficiency of
the parallel implementation, rather than on the merits of AMG as a preconditioner.
Matrix Unknowns Nonzeros
1a. 2D FD, 5-point 1,048,576 5,238,784
1b. 2D FE, 9-point 1,048,576 9,424,900
2a. 3D FD, 7-point 1,030,301 7,150,901
2b. 3D FE, 27-point 1,030,301 27,270,901
3a. 2D FE, h ≈ 0.03 550,387 3,847,375
3b. 2D FE, h ≈ 0.02 1,182,309 8,268,165
3c. 2D FE, h ≈ 0.015 2,185,401 15,287,137
4. 3D FE, h ≈ 0.15 1,088,958 17,095,986
5a. 2D FE, horseshoe 853,761 5,969,153
5b. 2D FE, square 832,081 5,817,905
Table 2.3: Details of the model problem. Here h represents an average mesh diameter for the tessel-
lation.
2.5.4 Component Performance
Table 2.5 reports timings for sparse matrix-vector multiplication, sparse matrix transposition, and sparse
matrix-matrix multiplication using MKL and our Thrust-based implementation for the CPU and GPU
results, respectively. All computations use double precision (64-bit) floating point arithmetic. The timings
33
in Tables 2.4 and 2.5 are reported in milliseconds, averaged over 100 operations.
Table 2.4 demonstrates the performance of the level 1 BLAS functions DDOT and DAXPY, which are applied
frequently during the cycling phase of the solver. These algorithms execute very few arithmetic operations
per memory access and are therefore limited by the available memory bandwidth. For the largest input
size the GPU achieves a maximum speedup of 7.8× in DDOT. On the smallest input size, the GPU results
in a more modest speedup of 3.3× in DDOT, which is attributed to fixed costs in our implementation. In
contrast, the DAXPY speedup is relatively constant across input sizes and simply reflects the relative memory
bandwidth of the two processors.
DDOT DAXPY
Size CPU GPU Speedup CPU GPU Speedup
1M 17.00 55.42 3.3 17.69 120.55 6.8
2M 17.59 78.15 4.4 18.05 121.32 6.7
4M 17.70 97.33 5.5 18.29 121.50 6.6
8M 17.96 110.44 6.2 18.27 121.28 6.6
16M 17.61 118.76 6.7 18.11 120.79 6.7
32M 15.81 123.48 7.8 17.78 120.25 6.8
Table 2.4: DDOT and DAXPY bandwidth (GB/s).
Sparse matrix-vector multiplication operations are used heavily in the cycling phase of the solver (cf.
Algorithm 3) and to a lesser extent in the setup phase (cf. Algorithm 1) as well. Table 2.5 reports SpMV
timings using the Hybrid (HYB) format on the GPU and Compressed Sparse Row (CSR) format on the CPU.
As with level 1 BLAS operations, the SpMV operation performs few arithmetic operations per memory access
and is memory bound. However, sparse matrix-vector multiplication performance is also sensitive to the
sparsity structure of the matrix, as it impacts the locality of memory accesses and creates potential variation
in the amount of work per thread. Therefore, SpMV does not always saturate memory bandwidth to the
same degree as algorithms with simpler memory access patterns such as DAXPY (see [10] for a more detailed
study of GPU SpMV performance). Across the cases considered, the GPU achieves an average speedup of
5.9× in comparison to the CPU. Figure 2.13 demonstrates that although the CPU SpMV performance does
benefit from multi-threading; the marginal speedup diminishes rapidly as the available memory bandwidth
is saturated.
A sparse matrix transpose operation is used at each level of in the setup phase (cf. Algorithm 1) to obtain
the restriction operator Rk = P
T
k from the prolongation operator Pk. Our parallel algorithm for transpose is
3.0× faster than MKL on average. We note that the MKL transpose does not benefit from multithreading,
suggesting the that underlying implementation is sequential. Indeed, our own sequential implementation of
sparse matrix transpose offers equivalent performance.
34
SpMV Transpose RAP
Matrix CPU GPU CPU GPU CPU GPU
1a. 2D FD 6.5 0.8 8.2 27.3 9.7 2.8 245.2 186.8 1.3
1b. 2D FE 9.5 1.3 7.2 25.8 10.1 2.6 345.3 286.4 1.2
2a. 3D FD 8.0 1.0 8.1 39.5 14.5 2.7 664.9 432.9 1.5
2b. 3D FE 23.0 3.4 6.7 47.6 16.7 2.9 1314.7 1289.7 1.0
3a. 2D FE 4.2 0.9 4.8 15.0 5.8 2.6 255.4 125.3 2.0
3b. 2D FE 8.9 1.7 5.3 33.7 11.2 3.0 548.3 270.5 2.0
3c. 2D FE 16.8 3.2 5.3 68.7 20.1 3.4 1037.0 497.1 2.1
4. 3D FE 16.3 6.5 2.5 52.6 17.7 3.0 1704.8 1042.1 1.6
5a. 2D FE 8.6 1.4 6.3 29.0 10.4 2.8 493.5 285.4 1.7
5b. 2D FE 10.1 1.9 5.2 38.8 10.0 3.9 588.2 242.6 2.4
Table 2.5: SpMV time / Transpose time / Galerkin Product time and speedups on the GPU in bold.
1 2 3 4
Number of threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
G
F
L
O
P
/s
1a
1b
2a
2b
3a
3b
3c
4
5a
5b
Figure 2.13: MKL SpMV GFlops per matrix for a varying number of threads.
The Galerkin Product results represent the cost of constructing the coarse-grid matrix by performing
two sparse matrix-matrix multiplication operations. Whereas the cost of other components, such as BLAS
algorithms, is directly proportional to the input size, the cost of sparse matrix-matrix multiplication is de-
pendent on the specific sparsity patterns of the two matrices and can differ between inputs of the same size.
Our sparse matrix-matrix multiplication algorithm achieves better performance than MKL in all cases con-
sidered, with an average speedup of 1.7×. As with the transpose, MKL’s sparse matrix-matrix multiplication
functionality extracts no observable benefit from multithreading and offers inferior performance to our own
sequential implementation based on the SMMP method [6]. As a result, the CPU performance results in the
remainder of this section only use MKL for sparse matrix-vector multiplication.
35
2.5.5 Setup Phase Performance
Section 2.5.4 demonstrates that GPU implementations of the essential sparse matrix operations achieve
tangible speedups over a competent CPU implementations. In this section we analyze the performance of
the whole multigrid setup phase (cf. Algorithm 1).
Figure 2.14 shows a breakdown of the parallel setup phase into the individual components. The figure
identifies several intensive calculations in the algorithm, namely aggregation and the Galerkin product.
As anticipated, the ability to express the strength-of-connection, tentative prolongator construction, and
prolongator smoothing functions in terms of parallel primitives leads to a efficient execution on the GPU.
Moreover, the figure also shows that matrix conversion is a relatively low cost, enabling the use of SpMV-
optimized matrix formats in the subsequent cycling phase. Moreover, the conversion to HYB format is
independent of the other operations and may be performed at any time prior to the solve phase. Since
matrix conversions are not performed in-place there are two versions of A in memory and all operations
utilize the COO format as input.
0 10 20 30 40 50 60 70 80 90 100
Percent Time
1a
1b
2a
2b
3a
3b
3c
4
5a
5b
M
at
ri
x
L
ab
el
Strength
Aggregation
Tentative
Conversion
Spectral Radius
Smooth
Transpose
Galerkin Product
323.55
495.14
616.47
2103.16
231.55
480.79
879.60
1605.43
493.08
476.98
T
ot
al
T
im
e
(m
s)
Figure 2.14: Relative time on the GPU of each setup phase component on the finest grid level.
The Galerkin product, which is implemented with two sparse matrix-matrix multiplications R ∗ (AP ), is
the most expensive step in all cases considered. The relative cost of the Galerkin product is slightly higher
in the 3D matrix examples (2a, 2b, and 4) compared to the 2D cases.
In Table 2.6, as a baseline, we also include timings from the ML package of Trilinos on the same hardware;
this validates that our CPU approach is in-fact a competitive implementation. ML is configured using the
default parameters in the setup phase and a Jacobi smoother in the solve phase, as opposed to the default
symmetric Gauss-Seidel smoother. We use a damping factor of 57 in the 3D problems as recommended in [34]
for the Jacobi smoother. The GPU implementation is faster in all ten examples, with an average speedup
36
of 1.8× over the CPU reference.
Matrix CPU GPU Speedup Trilinos-ML
1a. 2D FD 868 490 1.8 2040
1b. 2D FE 1062 611 1.7 2298
2a. 3D FD 1624 914 1.8 2906
2b. 3D FE 2849 2212 1.3 4420
3a. 2D FE 638 323 2.0 1324
3b. 2D FE 1436 619 2.3 2785
3c. 2D FE 2811 1047 2.7 5236
4. 3D FE 3092 1811 1.7 4967
5a. 2D FE 1653 1464 1.1 –
5b. 2D FE 1583 904 1.8 –
Table 2.6: AMG setup time for all components (ms).
2.5.6 Solve Phase Performance
In this section we present the results of our AMG preconditioned CG solver for each matrix in our test
suite. The AMG preconditioner uses a single iteration of weighted Jacobi at the of pre- and post-smoothing
steps of the V-cycle. The Trilinos data was collected using the default uncoupled aggregation method with
a tolerance of 10−12 and a maximum of 500 iterations.
Figure 2.15 shows that the main cost of the cycling is in restriction, prolongation, post-smoothing, and
the computation of the residual, which are all sparse matrix-vector multiplication operations. Moreover,
pre-smoothing is less costly than post-smoothing because we exploit the fact that x is initially zero and
replace the smoother x← x+D−1r, with the equivalent expression x← D−1b, with a savings of one sparse
matrix-vector multiply. Furthermore, Figure 2.16 also shows that the entire AMG cycling is on the order of
several matrix vector products, Ap.
Table 2.7 presents the results of the solve phase performance of our CPU and GPU implementations in
addition to results from Trilinos ML. The performance of ML is predictably poor because the solve phase is
executed on a single core, therefore in each iteration the SpMV operation, optimized for distributed execution,
incurs considerable overhead. The increase in the number of iterations required by the GPU reflects the fact
that, on structured problems, sequential aggregation method fortuitously selects square-shaped aggregates,
while the parallel aggregation method based on MIS(2) creates irregularly shaped aggregates. Despite the
increase in the number of iterations performed on the device, the increased SpMV performance using the
HYB format allows the GPU to outperform the CPU in all cases.
The larger number of iterations required to solve the anisotropic problems — i.e., 5a. 2D FE and 5b.
2D FE — on both the CPU and GPU reflects a weakness of the simple strength-of-connection measure
37
0 10 20 30 40 50 60 70 80 90 100
Percent Time
1a
1b
2a
2b
3a
3b
3c
4
5a
5b
M
at
ri
x
L
ab
el
Presmooth
Residual
AXPBY
Restrict
Prolongate
AXPY
Postsmooth
3.71
4.66
5.22
12.42
3.50
7.51
14.45
21.07
6.38
8.13
T
ot
al
T
im
e
P
er
It
er
at
io
n
(m
s)
Figure 2.15: GPU V-cycle time breakdown on the finest grid level.
0 10 20 30 40 50 60 70 80 90 100
Percent Time
1a
1b
2a
2b
3a
3b
3c
4
5a
5b
M
at
ri
x
L
ab
el
A*p
DOTC
AXPY
AXPY
AMG
DOTC
AXPBY
7.57
8.56
10.90
18.69
6.69
13.29
24.74
33.22
14.45
15.33
T
ot
al
T
im
e
(m
s)
Figure 2.16: Total preconditioned solver time breakdown time on the GPU.
considered in this paper. Since our focus is on exposing fine-grained parallelism in a well-known algebraic
multigrid method, we remark that more robust strength-of-connection schemes should be employed to im-
prove convergence for anisotropic problems [68].
In contrast with the results reported in Table 2.7, where two independent hierarchies generated on the
CPU and GPU are used in the cycling phase of the solver, in Figure 2.17 the same multigrid hierarchy is
used on both the CPU and GPU to isolate the performance of the cycling phase per level. The results
demonstrate that the GPU is noticeably faster than the CPU on the first two or three grid levels, which
represents the vast majority of the computation in a multigrid V-cycle. The coarsest grid levels are processed
more rapidly by the CPU, owing to communication latency and the fact that the amount of parallelism on
the smaller grids is insufficient to saturate the GPU. However, since the absolute time spent on such grids
38
Matrix CPU GPU Speedup ML
time it. time it. (per it.) time it.
1a. 2D FD 1225 20 427 51 7.6 14190 33
1b. 2D FE 1082 16 450 46 7.6 10590 22
2a. 3D FD 1755 23 294 27 6.7 14800 31
2b. 3D FE 1679 14 483 24 5.9 13840 20
3a. 2D FE 1535 42 286 49 5.3 14020 53
3b. 2D FE 3726 47 634 54 5.8 34410 68
3c. 2D FE 7787 53 1426 65 5.8 44530 65
4. 3D FE 4343 43 1499 50 3.0 28380 47
5a. 2D FE 27697 397 4092 333 4.8 – –
5b. 2D FE 25716 369 4556 309 4.5 – –
Table 2.7: AMG-PCG GPU solve time (ms), number of solver iterations and per-iteration speedup.
is small, the GPU cycle is faster overall. However, the same is not necessarily true of other multigrid cycles,
such as the F- or W-cycle, which visit coarse grids more frequently than fine grids.
0 1 2 3 4 5
10−2
10−1
100
101 1a
0 1 2 3 4 5
10−2
10−1
100
101
1b
0 1 2 3
10−2
10−1
100
101
2a
0 1 2 3
10−1
100
101
2b
0 1 2 3 4
10−2
10−1
100
101 3a
0 1 2 3 4
Level
10−2
10−1
100
101
3b
0 1 2 3 4
Level
10−2
10−1
100
101
3c
0 1 2 3
Level
10−2
10−1
100
101
4
0 1 2 3 4 5 6
Level
10−2
10−1
100
101
5a
0 1 2 3 4 5
Level
10−2
10−1
100
101
5b
Figure 2.17: CPU(blue) vs GPU(green) solve phase time(ms) per level.
2.5.7 Scalability
One important feature of a successful multigrid method is the ability to scale linearly with matrix size n.
Consequently, we expect our algorithm to also exhibit this scaling in observed setup and solve times. Here
we consider a Example 1a above — i.e., the 2D structured Poisson problem — where algebraic multigrid is
39
known to scale linearly. The problem size is scaled equally in each coordinate direction, and we measure the
wall-clock time of both the setup and solve phase.
The results are highlighted in Figures 2.18 and 2.19. As shown in Figures 2.18a and 2.19a, we observe
sub-linear scaling before the GPU is saturated, and the expected linear scaling after the problem size is
sufficiently large. The fortuitous sublinear growth is consistent with other algorithms that utilize primitives
on the GPU — e.g., see [61]. Indeed, as depicted in Figures 2.18b and 2.19b, if we measure the growth rate
r of time dependence O(nr) over a window of five samples, the scaling settles at a linear relationship.
104 105 106
Matrix Dimension n
100
101
102
Ti
m
e
t
(m
s)
O(√n)
O(n)
O(n2)
measured
(a) Measured setup time in comparison to O(√n),
O(n), O(n2).
104 105 106
Matrix Dimension n
0.0
0.5
1.0
1.5
2.0
G
ro
w
th
r
(b) Approximate rate of growth r in O(nr).
Figure 2.18: 2D structured setup scalability test.
104 105 106 107
Matrix Dimension n
100
101
102
103
Ti
m
e
t
(m
s)
O(√n)
O(n)
O(n2)
measured
(a) Measured solve time in comparison to O(√n),
O(n), O(n2).
104 105 106 107
Matrix Dimension n
0.0
0.5
1.0
1.5
2.0
G
ro
w
th
r
(b) Approximate rate of growth r in O(nr).
Figure 2.19: 2D structured solve scalability test.
40
2.6 Conclusions
We have demonstrated the first implementation of AMG that exposes fine-grained parallelism at all stages
of the setup and cycling phases. In particular, we have described highly-parallel methods for sparse matrix-
matrix multiplication and aggregation. Furthermore, we have shown that parallel primitives are a viable
substrate for describing complex sparse matrix operations and targeting the GPU.
Building upon earlier work in sparse matrix-vector multiplication, we have demonstrated meaningful
speedup in both the setup and cycling phases of the AMG solver on structured and unstructured problems.
Whereas CPU implementations of major components such as sparse matrix-vector multiplication, sparse
matrix-matrix multiplication, and sparse matrix transposition, derive little or no benefit from multithreading,
our implementations leverage scalable parallel primitives.
41
Chapter 3
Sparse Matrix-Matrix Multiplication
Operations on sparse data structures abound in all areas of information and physical science. In par-
ticular, sparse matrix-matrix multiplication (SpGEMM) is a fundamental operation that arises in many
practical contexts, including graph contractions [35], multi-source breadth-first search [18], matching [70],
and algebraic multigrid (AMG) methods [11]. In this chapter we discuss computing sparse matrix-matrix
products efficiently for data parallel environments.
While algorithms operating on sparse matrix and graph structures are numerous, a small set of operations,
such as SpGEMM and sparse matrix-vector multiplication (SpMV), form the foundation on which many
complex operations are built. An analysis of sparse matrix-vector multiplication (SpMV) reveals that the
operation has very low computational intensity — i.e., the ratio of the floating point operations (FLOPs)
to memory accesses — which severely limits the potential throughput of the operation on contemporary
architectures [86, 87]. A common strategy for improving SpMV performance is to exploit a priori knowledge
of the sparsity pattern of the matrix in order to minimize expensive off-chip memory operations. Since the
cost of reformatting the data is non-trivial, generally on the order of 10–20 SpMV operations, this approach
is profitable when the number of subsequent SpMV operations is relatively large.
Although SpMV is a useful starting point for understanding SpGEMM, we emphasize that the latter
is a qualitatively different problem with unique complexities and trade-offs in performance. In particular,
whereas the computational structure of SpMV is fully described by the matrix sparsity pattern, SpGEMM
adds another level of complexity through its dependency on the detailed interaction of two sparse matrices.
Indeed, simply computing the number of floating point operations required by the SpGEMM, or even the
size of the output matrix, is not substantially simpler than computing the SpGEMM itself.
The recent demand for high performance SpGEMM operations is driven by the increasing size of sparse
linear systems [16, 17]. AMG is an important example because the setup phase of the method relies on a
sparse triple-matrix product (ie., the Galerkin product). AMG solvers are generally divided into two phases:
setup and solve [11]. The relative cost of each phase varies, but the setup phase often represents a significant
42
portion (e.g. 25–50%) of the total solve time. Within the AMG setup phase, SpGEMM is the central
performance bottleneck, often accounting for more than 50% of the total setup cost as shown in Figure 3.1.
In contrast, the AMG solve phase is comprised of SpMV and level 1 BLAS operations and therefore readily
accelerated by employing existing highly-optimized GPU implementations [11].
0 10 20 30 40 50 60 70 80 90 100
Percent Time
1a
1b
2a
2b
3a
3b
3c
4
5a
5b
M
at
ri
x
L
ab
el
Other Operations Galerkin Product (SpMM × 2)
321.67
490.52
609.39
2072.12
230.90
477.25
869.67
1579.75
489.86
474.71
T
ot
al
T
im
e
(m
s)
Figure 3.1: Relative cost of the SpGEMM during the AMG setup phase for a series of matrices (see
Table 3.1).
The approach to SpGEMM in [11] is based on a decomposition of SpGEMM into 3 phases: expansion,
sorting, and contraction (ESC). The ESC formulation of the SpGEMM operation is naturally implemented
with data parallel primitives, such as those provided by the Thrust parallel algorithms library [45].
In this chapter we implement and analyze a set of optimizations to enhance SpGEMM performance by
exploiting on-chip memory, whenever possible, and reducing the cost associated with the sorting phase. Our
decomposition of SpGEMM exposes abundant fine-grained parallelism and is amenable to execution on the
GPU architecture. In particular, we develop parallelism at the level of individual matrix rows by studying
the interaction of the sparsity patterns of the input matrices and avoid arbitrarily poor decisions based upon
either of the individual input matrices. The algorithm utilizes an adaptive row processing approach that
improves performance by improving the utilization of multiple levels in the memory hierarchy.
3.1 Background
As described in Chapter 2 GPUs sacrifice serial performance of single thread tasks to increase the overall
throughput of parallel workloads. Effective use of the GPU depends on four key features: an abundance of
fine-grained parallelism, uniform work distribution, high arithmetic intensity [87], and regularly-structured
memory access patterns. Workloads that do not have these characteristics often do not fully utilize the
43
available computational resources and represent an opportunity for further optimization. In this chapter
we seek to characterize the nature of SpGEMM and to decompose the computational work to suit the
GPU architecture. In particular, by concentrating on the intersection of the input matrices and slightly
coarsening the degree of parallelism we greatly reduce the number of off-chip memory references to improve
the arithmetic intensity of the bandwidth-limited SpGEMM operation.
Given two sparse matrices A ∈ Rm×k and B ∈ Rk×n, for k,m, n ∈ N, SpGEMM multiplication computes
C = AB, (3.1)
where C ∈ Rm×n. We denote nnz(A) as the number of nonzeros in sparse matrix A. The sparsity of
A and B implies that both input matrices are represented in a space-efficient format that avoids storing
explicit zero values. Although SpGEMM is related to both the SpMV operation and to dense matrix-matrix
multiplication — e.g., GEMM in BLAS — the formulations and optimizations are fundamentally different.
Both SpMV and GEMM have achieved near optimal implementations on GPUs through regularization of
the data access patterns and algorithmic reformulations [19, 53], approaching the (theoretical) peak limits
of memory bandwidth and arithmetic throughput, respectively.
In contrast to GEMM, SpGEMM operations are highly irregular and may exhibit considerably lower
arithmetic intensity. Techniques to improve performance through sparsity pattern analysis, such as those
for SpMV, are less effective because SpGEMM is in general a fleeting operation, meaning that they are
called at most once for a given set of matrices in most applications. Indeed, whereas the same sparse matrix
participates in hundreds of SpMV operations in the context of a single iterative solver, SpGEMM operations
are generally outside the innermost solver loop.
Sequential SpGEMM algorithms [6, 40], generally operate on sparse matrices stored in the Compressed
Sparse Row (CSR) format, which provides O(1) indexing of the matrix rows, but O(nnz(A)) access to
columns. These methods construct each output row by iterating over the rows of A and for each column entry
performing a scale and reduction operation on the values in the corresponding row from B. To accomplish
this sequential algorithms rely on a large amount, O(N), of temporary storage to efficiently store and reduce
unique entries on any row of C. Therefore, sequential methods focus on constructing the output matrix and
accessing both input operands on a per row basis. Parallel SpGEMM algorithms generally decompose the
matrix into relatively large submatrices and distribute the submatrices across multiple processors for parallel
computation, a strategy used in many computational software packages which use MPI such as Trilinos and
PETSc [4, 44].
The reliance in sequential methods on O(N) storage renders these methods untenable on GPUs, which
44
thrive on workloads in which the per thread state is considerably smaller — i.e., on the order of tens of
values. Furthermore, a straight-forward parallel approach to SpGEMM on the GPU would decompose the
matrices on a per thread or CTA basis which may be advantageous but requires complex decompositions to
avoid unnecessarily high imbalances in the work distribution. The focus of the work here is on developing
fine-grained parallelism to avoid these bottlenecks on the GPU.
3.2 ESC Algorithm
A direct implementation of the ESC algorithm using parallel primitives is “work-efficient” and insensitive
to the sparsity pattern of the matrices [11]. Although SpGEMM is highly unstructured and gives rise to
complex and unpredictable data access patterns, the ESC algorithm distills the computation into a small
set of data-parallel primitives such as gather, scatter, scan, and stable sort by key, whose performance
characteristics are readily understood [63]. As a result, the ESC algorithm with parallel primitives is robust,
reliable and efficient. The high-level structure of the ESC algorithm is summarized in Algorithm 6. To
facilitate the memory constrained nature of the GPU large problems are decomposed into smaller more
manageable slices. The ESC algorithm achieves this decomposition by partitioning A into M submatrices,
where each row of A is in one and only one submatrix. The operation slice(A) in Algorithm 6 generates
subsets of contiguous rows by considering the row-wise memory requirements and selecting a set of rows
to process in parallel based on the amount of available global memory. Next, expand(Ak, B) generates
the products associated with slice Ak and B in coordinate (COO) format, described in further detail in
Section 3.4. Then the sort(Cˆk) operation orders these expanded products by row and column indices. The
sorted products are subsequently processed by the contract(Ck) operation to compute the sum of duplicate
entries and store the reduced set of nonzero entries produced by Ak. Finally, the construct(C) operation
allocates memory sufficiently large to store the number of entries in the output matrix C and concatenates
the slices, Ck, together.
As an example, consider the matrices
A =

10 0 0 0
0 20 30 40
0 0 0 50
0 60 0 0

, B =

1 0 0 0
0 2 0 3
4 5 0 0
0 6 0 7

, C = AB =

10 0 0 0
120 430 0 340
0 300 0 350
0 120 0 180

, (3.2)
45
Algorithm 6: SpGEMM: Reference
parameters: A, B
return: C
1 M ← slice(A) {decompose rows into slices}
for k = 0, . . . ,M
2 Cˆk ← expand(Ak, B) {expand intermediate matrix}
3 Cˆk ← sort(Cˆk) {radix sort Cˆk by row and column}
4 Cˆk ← contract(Cˆk) {contract duplicate Cˆk(row, col) entries}
5 C ← construct(Cˆ) {concatenate slices to form final matrix}
where the COO representation is given by the tuples
A =

(0, 0, 10)
(1, 1, 20)
(1, 2, 30)
(1, 3, 40)
(2, 3, 50)
(3, 1, 60)

B =

(0, 0, 1)
(1, 1, 2)
(1, 3, 3)
(2, 0, 4)
(2, 1, 5)
(3, 1, 6)
(3, 3, 7)

C = AB =

(0, 0, 10)
(1, 0, 120)
(1, 3, 340)
(1, 1, 430)
(2, 3, 350)
(2, 1, 300)
(3, 3, 180)
(3, 1, 120)

. (3.3)
Then the expansion, sorting, and contraction phases yield
Cˆ =

(0, 0, 10)
(1, 3, 60)
(1, 1, 40)
(1, 1, 150)
(1, 0, 120)
(1, 3, 280)
(1, 1, 240)
(2, 3, 350)
(2, 1, 300)
(3, 3, 180)
(3, 1, 120)

−−−−−−−−→
sort

(0, 0, 10)
(1, 0, 120)
(1, 1, 40)
(1, 1, 150)
(1, 1, 240)
(1, 3, 60)
(1, 3, 280)
(2, 1, 300)
(2, 3, 350)
(3, 1, 120)
(3, 3, 180)

−−−−−−−−→
contract

(0, 0, 10)
(1, 0, 120)
(1, 1, 430)
(1, 3, 340)
(2, 1, 300)
(2, 3, 350)
(3, 1, 120)
(3, 3, 180)

= C. (3.4)
Here we see that general sparsity patterns lead to a variety of row lengths in Cˆ. To further illustrate this point
46
consider a sparse random matrix of size n = 200 with an average of 20 nonzeros-per-row (see Figure 3.2a),
yielding a minimum and maximum sort length of 156 and 624, respectively, as shown in Figure 3.2b. Here
nnz(A) = nnz(B) = 3812, and nnz(C) = 33678, while Cˆ contains 75786 entries.
0 50 100 150
0
50
100
150
(a) Random sparse matrix pattern.
100 200 300 400 500 600 700
row length
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
fr
e
q
u
e
n
cy
(b) Distribution of row lengths.
Figure 3.2: Sparse matrix with n = 200 yielding a range of row lengths in Cˆ.
In the following, for a sparse matrix A we denote by Arowi the i
th row of A (and similar for columns),
while NNZ(Arowi) denotes the set of nonzero column indices. The ESC process for (3.1) follows from the inner
product view of multiplication:
Ci,j = Arowi ·Bcolj =
∑
k
Ai,kBk,j . (3.5)
From this we see that simultaneous access to Arowi and Bcolj is necessary to construct entry Ci,j . Yet, there
are two issues to address when considering the inner product formulation of SpGEMM: intersecting sparsity
patterns and sparse storage formats. Intersecting sparsity patterns requires the categorization of all of the
entries in C into zero and nonzero values. The work performed in the SpGEMM should avoid operations
on zero values of Ci,j , which implies row i of A and column j of B have non-intersecting sparsity patterns.
However, to identify non-intersecting sparsity patterns the na¨ıve inner product formulation requires explicit
checking of all possible mn entries in C thereby generating excessive data movement when C is also sparse.
Another potential problem with the inner product formulation is the access pattern of entries in A and
B. As noted earlier, if B is stored in CSR format then accessing entries on a per column basis results in
significant overhead and should therefore be avoided whenever possible. We focus on CSR based approaches
but note that our algorithms are equally applicable to CSC based data structures since the transpose of any
matrix stored in CSR is equivalent to the original matrix stored in CSC format. Therefore if the underlying
data-structure for both matrices is CSC, then multiplying the matrices in reverse order yields the transpose
47
of our CSR base approach.
In the following sections, we consider the basic form of our ESC algorithm [11] in Algorithm 6 as the
reference implementation. However, there are several limitations with this approach. First, implementing
the operation using parallel primitives forces many data movement operations in global memory between
primitives. Moving through global memory between operations ignores more efficient use of registers and
shared memory to seamlessly process data from successive phases locally. Second, by staging values in global
memory and relying on radix sorting, which is not in-place, to order the intermediate matrix the amount of
temporary global memory required by the method is significant. Lastly, although radix sorting on GPUs is
fast and efficient it is a O(kN) algorithm, with k representing the number of passes, and requires random
accesses to reorder data in global memory. In addition, the costs are compounded by requiring two sorting
operations, first by column index and then by row index, to ensure the intermediate format is in proper
format for contraction.
To motivate where in the ESC algorithm (Algorithm 6) we focus our optimizations, we consider a set of
matrices for A that result from a discretization of a Poisson problem, −∇ ·∇u = 0, with Dirichlet boundary
conditions and an average mesh diameter h in the case of unstructured tessellations. The matrices considered
are outlined in Table 3.1, along with several additional test problems from the University of Florida Sparse
Matrix Collection and previously found in GPU SpMV data sets [10, 11, 28]. Here, cases 1 and 2 are
structured, while 3–5 are unstructured tessellations. For matrix B, we generate an interpolation matrix
through smoothed aggregation-based AMG [11].
Figure 3.3 shows the per-phase-cost associated with the reference ESC implementation. Note the negli-
gible overhead in the analysis (setup) phase, where the GPU memory constraints are used to decompose the
formation of Cˆ row-wise, in contrast to the substantial overhead associated with the sorting phase. In the
following sections we detail a method of work decomposition to increase the use of shared memory through all
phases of the operation and improving the sorting performance by reducing N in the radix sorting algorithm.
3.3 Setup
A straight-forward approach to processing SpGEMM operations in parallel would process the input
matrices using the natural ordering of the operands and assign a fixed number of threads and memory
per row of the output matrix. However, if C is constructed row-wise then assigning a fixed number of
computational units per row of the output matrix may result in significant load imbalance. To illustrate,
48
Row-wise stats
Matrix n nnz min max mean
1a. 2D FD 1 048 576 5 238 784 3 5 4.99
1b. 2D FE 1 048 576 9 424 900 4 9 8.99
2a. 3D FD 1 030 301 7 150 901 4 7 6.94
2b. 3D FE 1 030 301 27 270 901 8 27 26.47
3a. 2D FE 550 387 3 847 375 4 11 6.99
3b. 2D FE 1 182 309 8 268 165 4 11 6.99
3c. 2D FE 2 185 401 15 287 137 4 11 6.99
4. 3D FE 1 088 958 17 095 986 5 38 15.69
5a. 2D FE 853 761 5 969 153 3 7 6.99
5b. 2D FE 832 081 5 817 905 4 10 6.99
Cantilever cant 62 451 4 007 383 1 78 64.17
Spheres consph 83 334 6 010 480 1 81 72.13
Accelerator cop20k A 121 192 2 624 331 0 81 21.65
Economics mac econ fwd500 206 500 1 273 389 1 44 6.17
Epidemiology mc2depi 525 825 2 100 225 2 4 3.99
Protein pdb1HYS 36 417 4 344 765 18 204 119.31
Wind Tunnel pwtk 217 918 11 634 424 2 180 53.39
QCD qcd 5 4 49 152 1 916 928 39 39 39
Webbase webbase-1M 1 000 005 3 105 536 1 4700 3.11
Table 3.1: Test problems of square matrices (n×n) with nnz nonzeros. The average number of entries
for the companion B matrices are 2.93 and 3.79 for the AMG and SpMV matrices respectively.
0 10 20 30 40 50 60 70 80 90 100
Percent Time
1a
1b
2a
2b
3a
3b
3c
4
5a
5b
M
at
ri
x
L
ab
el
Setup Expansion Sort Contraction Finalize
96.82
197.72
205.93
908.45
74.65
174.23
327.30
574.06
94.82
97.85
T
ot
al
T
im
e
(m
s)
Figure 3.3: Component-wise performance of the reference ESC SpGEMM operation.
the minimum number of FLOPs associated with forming Crowi is proportional to
∑
j∈NNZ(Arowi )
nnz(Browj ).
This quantity represents the total number of products required to scale each row of B referenced by each
column entry within the row.
In Figure 3.4 the variation in the intermediate matrix, with respect to the number of products, is depicted
49
for several test instances described in Section 3.7. The histograms are formed by grouping each row of Cˆ
according to total number of products. We use a coloring scheme to draw attention to regions of interest in
which processing rows at various granularities of parallelism on the GPU may be advantageous. For example
the regions in blue represent rows of length less than 128 which could be processed by a single warp, while
regions in green range from 128 to 1024 benefit from a CTA oriented processing scheme. Finally, rows larger
than 1024 are highlighted in red to indicate processing with multiple CTAs necessitating the use of global
memory to communicate intermediate results. Figure 3.4d highlights the particularly challenging nature
of some SpGEMM instances, note that the row lengths of Cˆ vary so dramatically in size that we plot the
logarithm of the row sizes and therefore this SpGEMM instance generates substantially diverse worloads per
row.
Based on Figure 3.4 we conclude that any static assignment of computational units to process a fixed
number of rows of the input matrix, A, could lead to arbitrarily poor load balance and possible degradation
in performance. One strategy for avoiding this load imbalance is to implement the entire algorithm in
terms of parallel primitives [11]. While this approach thoroughly eliminates load imbalances it does so at
the cost of significant data movement between stages. An alternative method relies on dynamically scaling
computational units to address the data-dependent workloads. In a GPU architecture, the allocation of
computational units should account for processing small workloads completely in shared memory as opposed
to global memory, and scaling should take advantage of the parallel execution across arbitrary groups of
threads within a CTA and multiple CTAs. While ideal, this dynamic scaling is difficult to implement
effectively in software. Therefore we use a different approach based on the work distribution model that
groups rows into several categories that are processed using the most appropriate method.
We observe that C may be assembled in any order, thus we may permute the input matrices to achieve a
grouping of the output rows which yield a favorable use of the computational units. Our scheme is based on
reordering the output matrix rows by the amount of computational work in the model. This sorting yields
a permutation matrix P for C and implies that PC = PAB which translates into processing the rows of
A in permuted order. The permuted order of A groups rows of similar total work and places the rows in
non-decreasing order of the work-per-row, Algorithm 7. Identification of rows to be processed by individual
threads, warps, or CTAs may be carried out using a model parameterized by the size and number of rows
fulfilling predefined conditions. Our strategy is to use a splitting to decompose the rows into units that are
processed within a targeted group of threads using parallel primitives. One drawback of this approach is
that C is generated in a permuted order and must be sorted by row before the final output is generated.
However, in practice we find that this reassembly cost is relatively low.
50
4 6 8 10 12 14
Cˆ Row Size
10−1
100
101
102
103
104
105
106
N
u
m
b
er
of
R
ow
s
(a) 1a. 2D FD, 5-point
0 100 200 300 400 500
Cˆ Row Size
101
102
103
104
N
u
m
b
er
of
R
ow
s
(b) Cantilever
0 50 100 150 200 250 300 350 400 450
Cˆ Row Size
100
101
102
103
104
105
106
N
u
m
b
er
of
R
ow
s
(c) Wind Tunnel
0 2 4 6 8 10 12 14 16
Cˆ Row Size
100
101
102
103
104
105
106
N
u
m
b
er
of
R
ow
s
(d) Webbase
Figure 3.4: Distribution of Cˆ row lengths for SpGEMM operations in Tables 3.4,3.5. Rows are grouped
by color: 0–127 (Blue), 128–1024 (Green), and ≥ 1024 (Red).
Algorithm 7: SpGEMM: reorder
parameters: A, B {A ∈ Rm×k and B ∈ Rk×n}
return: P {reordering vector}
Fi = 0 for i = 1 to m
foreach row i in C do
for j ∈ NNZ(Arowi)
Fi ← Fi + nnz(Browj ) {gather B row lengths based on A column indices}
P ← sort(F ) {set P to permutation of F in non-decreasing order}
3.4 Expansion
As illustrated in Figure 3.5, the expansion phase expands scaled rows of B into an intermediate buffer.
Expanding B row-wise ensures efficient access when the underlying sparse storage format is CSR and all
expanded entries contribute to the nonzero entries in C. The expanded memory buffer consists of row
51
indices Iˆ, column indices Jˆ , and values Vˆ , which we collectively denote as Cˆ = (Iˆ , Jˆ , Vˆ ). The formation of
Cˆ requires gathering possibly disparate rows from B dictated by the column indices of A. Loading rows of
B in a random manner limits the benefit of coalescing and therefore precludes fully utilizing the memory
bandwidth of the GPU. In particular, if a fixed unit of threads is assigned to load rows from B, and the
average row length is significantly smaller than the number of assigned threads, then many threads are either
idle or loading unrelated entries from adjacent rows. In contrast if the average row length of B is significantly
larger than the number of assigned threads, then multiple sequential loading phases are required to process
the row.
A	   B	  
1 2 5 1 2 1 5column	  
scale	  
store	  
value	  
1 1 1 2 2 5 5row	  
1	   2	   3	   4	   5	   6	  
1	  
2	  
3	  
4	  
5	  
6	  
Figure 3.5: Scaling rows of B by column indices corresponding to nonzero entries of A and storing in
intermediate buffer.
To address the deficiencies in the expansion phase we adopt a formulation of SpGEMM as a layered graph
model [24]. Each input matrix is represented as a bipartite graph with vertices defined by the individual
rows and columns in the matrix. For each nonzero entry,(i, j, v), in the matrix, a directed edge is added
between the row i and column j vertices in the bipartite graph with weight v. The bipartite graphs are then
concatenated — i.e., layered — by joining the graphs along the inner dimension vertex sets. The equality of
the cardinality of the joined vertex sets is assured by assuming the proposed multiplication is well-posed —
i.e., the inner matrix dimensions agree. As an example, we illustrate the layered model diagram in Figure 3.6
52
using matrices A and B defined as
A =

x 0 x 0
0 x 0 x
0 x x 0
x 0 0 x

and B =

x x 0 0
x x x 0
0 x x x
0 0 x x

. (3.6)
ROWS
A
COLS
ROWS
B
COLS
(a) Layered graph model of AB.
ROWS
A
COLS
ROWS
B
COLS
i
k2 k4
j
(b) Paths contributing to an entry.
Figure 3.6: Schematic of graph-based sparse matrix multiplication.
In the layered graph model any nonzero, Ci,j , in the output matrix corresponds to a path to vertex j
in the column set of B from vertex i in the row set of A. A weight is attributed to all paths according to
a binary operation — e.g., multiplication in the case of SpGEMM — on the weight of the individual edges
traversed by the path. Based on this formulation the expansion phase is an operation on graphs rather
than algebraic structures and enumerating the entries which contribute to all output nonzeros is recast as
computing all-pairs-all-paths between the column set of B and the row set of A.
By viewing the expansion phase from a graph perspective we see that expansion is a candidate for a
breadth-first-search (BFS) of the levels in the layered model. BFS traversals are effectively mapped to
GPUs using efficient expansion methods designed to dynamically scale the number of threads expanding the
frontier of a single vertex within a CTA [60]. Starting from the source vertices in the layered model — i.e.,
vertices in the bipartite graph with an in-degree of zero — is unnecessary because the first expansion is
implicitly defined as all the column indices in A. Therefore the column indices of A identify the vertices in
the frontier which corresponds to rows of B which must be expanded. However, in contrast to previous BFS
implementations [60], the edges in the layered model are weighted. Moreover, although duplicate vertices
appear in the frontier, the distinct weights associated with the edges prevent the removal of duplicates, to
reduce redundant expansion operations.
The work in the expansion phase is decomposed at the granularity of one thread per nonzero entry in
A. Each thread within the warp or CTA computes the length of the row referenced from B and expansion
53
Algorithm 8: SpGEMM: Expansion
parameters: A,B
return: Cˆ = (Iˆ , Jˆ , Vˆ )
foreach row i in C do
for k ∈ NNZ(Arowi) {Note Ai,k is stored in shared memory for reuse}
for j ∈ NNZ(Browk)
1 Iˆ ← [Iˆ , i] {implicit row index}
2 Jˆ ← [Jˆ , j] {append column index}
3 Vˆ ← [Vˆ , Ai,k ∗Bk,j ] {append value}
proceeds using either fine-grained scan-based or cooperative warp or CTA expansion routines [60]. The
expansion phase is therefore efficient and the imbalance between CTAs is negligible. To reduce the costs
of repeated loading of values from A, each thread stores their entry from A to shared memory. Once the
row corresponding to the given thread is expanded the column indices are stored in either local registers
in preparation for the impending sorting operation outlined in the following section or streamed to global
memory along with the floating point values if global memory processing is necessary. Prior to streaming
the floating-point values from shared memory the per-thread values from A are broadcast to shared memory
and the entries from B are scaled appropriately. A high-level description of the expansion phase is outlined
in Algorithm 8 where the loop over entries in each row of A on lines 2 and 3 is decomposed at the granularity
of the thread group, which may be a thread, warp, or CTA.
3.5 Sorting
The expansion phase generates a partially sorted matrix, Cˆ, in coordinate format (cf. (3.4)) with duplicate
entries. Since there are an undetermined number of duplicates for any column entry j within the extent of
row i, the partial ordering creates a bottleneck in reducing the duplication. To do this efficiently, Cˆ is first
sorted by row and column, transforming Cˆ into a sorted format with duplicate entries in adjacent locations.
Figure 3.3 underscores the expense of the sorting performance, which shows it is the dominant cost in the
reference ESC algorithm.
Since sorting is the dominant expense we focus on improved SpGEMM performance by either employing
a faster sorting algorithm or exploiting our knowledge about the range of input values. Figure 3.7 illustrates
the potential speedup in the sorting performance yielded by two such improvements. By default the Thrust
sorting algorithms allocate and free large amounts of temporary memory each time they are invoked, which
54
represents a non-trivial cost. Using the preallocated memory interface1 we improve the sorting performance
of our previous SpGEMM implementation by minimizing the number of allocations. As a comparison,
Figure 3.7 also captures the performance of the back40computing (B40C) implementation [59] from which
the Thrust sorting implementation was derived. The B40C radix sort allows specializations in the number
and location of the sorting bits. We exploit this feature to achieve optimal sorting by noting that the total
number of bits in the row and column indices of Cˆ are dlog2(m)e and dlog2(n)e.
103 104 105 106 107
Number of array entries
0.00
0.05
0.10
0.15
0.20
0.25
0.30
G
ig
a-
en
tr
ie
s
so
rt
ed
/S
ec
on
d
Thrust
Thrust(cache)
B40C
B40C(opt)
Figure 3.7: Sorting performance comparison of Thrust and B40C routines.
Figure 3.7 highlights the limited gains achieved by focusing on improving global sorting methods, therefore
we concentrate on sorting within the GPU’s higher-bandwidth shared memory for increased efficiency. We
observe that Cˆ consists of a collection of rows of various lengths which may be processed in parallel since
there are no dependencies between matrix rows. There are two advantages of operating on a per row basis:
1) two global sorting operations over millions of entries are replaced by numerous operations over possibly
tens to thousands of entries and 2) sorting the intermediate entries using shared memory reduces global
memory operations. However, row-wise sorting using shared memory places tight bounds on the number of
intermediate entries which may be processed per thread group precluding the use of such a method for rows
which exceed the maximum shared memory space. Consequently, we identify the rows that violate this space
constraint during the analysis phase and process them using the global memory ESC algorithm outlined in
Chapter 2.
The localized shared memory sorting routine is implemented using the highly efficient CTA oriented
radix sorting implementation exposed by the CUB programming library, a collection of CTA primitives [59];
we implement thread and warp variants. To address the possible inefficiency of attempting to sort a highly
1Introduced in version 1.6 of Thrust
55
varying workload using a static number of threads we scale the number of threads per row of Cˆ proportionally
with the maximum number of entries produced during the expansion. Therefore if nnz(Cˆrowi) ≤ αthread we
sort each row within a single thread using a sorting network in an “embarrassingly parallel” manner. This
optimization dramatically reduces the overall costs of the sorting phase by completely decoupling the threads
and preferring the execution of the sorting phase in registers over shared memory. Similarly if nnz(Cˆrowi)
∈ (αthread, αwarp] then each row is assigned to 1 (32-thread) warp and ordered using radix sort. The remaining
rows in the range of (αwarp, αcta] are processed using an entire CTA.By scaling the number of registers and
threads on a per row basis our approach reduces the number of the wasted memory operations caused by rows
whose size does not perfectly match any of our targeted sorting boundaries and allows the cost of the sorting
pass to scale proportionally with the size of the row. The values of [αthread, αwarp, αcta] are parameters that
may be set — e.g., we use [32, 736, 6144] in our tests. We note that by sorting rows independently in registers
or shared memory that additional optimization techniques, such as packing permutation into column indices,
are applicable though highly dependent on the specific SpGEMM instance.
Algorithm 9: SpGEMM: Sorting
parameters: n: number of columns in B
Cˆ = (Iˆ , Jˆ , Vˆ ) : column indices, Jˆ , reside in shared memory
return: Cˆ, P {Jˆ sorted row-wise and permutation vector P}
foreach row i in C do
1 J, V ← extracti Jˆ , Vˆ {extract entries where Iˆ ≡ i}
2 m← nnz(J) {m is the number of expanded entries}
3 J, P ← key value sort(J, [0,m]) {keys-value sort}
The extracti function in Algorithm 9 refers to the subset of entries on row i of Cˆ. Note that because
short rows are generated in shared memory using cooperative warp or CTA methods during the expansion
phase then references to the subset of entries on any particular row of Cˆ simply refers to the shared memory
region with no additional processing required. In the case of global memory processing any specific row
of Cˆ may be extracted by computing the prefix-sum of the number of expanded products per row. This
information is readily available as a byproduct of the analysis phase and therefore extracti represents a
CSR style referencing of row i by the offset of the first entry in that row.
3.6 Contraction
The next computationally intensive phase contracts the values associated with duplicate column indices
56
in Cˆ using pairwise addition. In contrast to the predictable nature of the total work required to construct
Cˆ in the expansion phase, the number of duplicates, and therefore the number of FLOPs to form any row
of C is not easily known a priori and often varies significantly between rows in Cˆ. As noted previously
in Section 3.4 the structure of row Cˆrowi is dependent on the set of rows from B referenced by the column
indices in Arowi .
The irregularity of the work required to reduce duplicate entries per row in Cˆ may cause imbalance in
the contraction phase if rows are contracted using a fixed number of threads. The reduce by key function
in Thrust avoids imbalance [11] by reducing duplicates in adjacent locations at the granularity of a fixed
number of entries per CTA irrespective of the number of duplicates per row. While reduce by key is general
and avoids excessive imbalance, it relies on constructing keys and values in global memory.
Algorithm 10: SpGEMM: Contraction
parameters: Cˆ = (Iˆ , Jˆ , Vˆ ), P {P permutation which sorts Cˆ row-wise}
return: C
foreach row i in C do
1 v ← 0 {initialize output value}
2 J, V ← extracti Jˆ , Vˆ {extract entries where Iˆ ≡ i}
for j = 0, . . . , nnz(Cˆrowi) {Segmented scan by keys, J, over values in V }
3 v ← v + V [Pj ] {reduce consecutive values}
if J [j] 6= J [j + 1] {J[j + 1] marks beginning of new nonzero entry}
4 Ci,J[j] ← v {Store accumulated value to output row}
5 v ← 0 {re-initialize output value}
Storing the keys and values in global memory for relatively long rows allows multiple processing units
to work cooperatively to reduce values but ignores possible optimizations associated with utilizing shared
memory storage. Following the sorting phase outlined in Section 3.5 the column indices, in nondecreasing
order, and the permutation which achieves the sorted ordering are stored in shared memory. Algorithm 10
describes the construction of the reduced set of entries on any row of C by performing both reduction and
store operations. This contraction phase may therefore be decomposed into two phases, the reduction of the
scaled values corresponding to the same column index and the storage of entries into C. First, the scaled
values, which are computed and stored in temporary global memory during the expansion phase, are streamed
into registers in sorted order using the precomputed permutation indices. Then, the summation of values
associated with identical column indices are computed using a specialized segmented scan operation which
immediately stores the column index and value corresponding to the last duplicate entry in each segment.
The most inefficient portion of this value contraction algorithm is the loading of values from global memory
57
in permuted ordering, however this penalty is mitigated by the implicit spacial locality of the referenced
values.
3.7 Evaluation
In this section, we examine the performance of a GPU implementation of the proposed ESC method. All
of the operations are performed using double precision with error-correcting code (ECC) memory support
disabled and the times reported are the average for 10 iterations. Though ECC is an important feature to
mitigate random errors in DRAM memory it also decreases the peak achievable bandwidth on the device.
We refer to our proposed approach as “Optimized”2 and compare against the reference ESC described in
Chapter 2 as well as the Cusparse SpGEMM implementation [64]. Our system is configured with CUDA
v5.0 [67] and Thrust v1.7 [45], and all tests are performed using an Nvidia Tesla C2075 [65].
3.7.1 SpGEMM
Intermediate Factors
We characterize the SpGEMM multiplication pairs by the expansion and contraction factors associated with
the intermediate matrices. The expansion factor, nnz(Cˆ)/nnz(A), describes the ratio of the number of
memory references from A to the number data movement operations from B. A relatively large expansion
factor indicates that the number of load operations per memory reference is high. The contraction factor,
nnz(Cˆ)/nnz(C), describes the ratio of the number of unique entries in C to the number of duplicates in
the expanded format. A relatively large contraction factors indicate a contraction phase with relatively few
FLOPs.
Table 3.2 illustrates the variation in the expansion and contraction factors possible for computing the
inner, ATA, and outer, AAT , products of a random, sparse matrix with dimensions 10242 × 1024 and a
density of 10−3. For the inner product the expansion phase consists of a large collection of sparse rows
resulting in a contraction phase with a large number of duplicates. In contrast the outer product expansion
phase consists of a small collection of rows with many entries which do not contain duplicates and the
contraction phase therefore has relatively little work. The large variation in the intermediate factors of this
example illustrates the need for an adaptive method.
2Test matrices and code are avaiable at http://lukeo.cs.illinois.edu/spgemmdata/index.html
58
Expand: nnz(Cˆ)/nnz(A) Contract: nnz(Cˆ)/nnz(C)
Min Max Mean Std Min Max Mean Std
Inner 1.00 1.14 1.05 0.02 7.50 108.00 20.19 9.49
Outer 73.0 140.0 105.99 10.80 1.00 1.01 1.00 0.00
Table 3.2: Expansion and contraction factors for a 10242 × 1024 matrix.
Performance
Table 3.3 outlines the performance for each phase of the ESC algorithm for a few representative matrices.
The companion operator, B, used in all operations is generated using a smoothed aggregation interpolation
matrix found in algebraic multigrid methods [11]. The interpolation operators typically have the form of
a scattering operation. In particular, in this example interpolation uses a distance-2 maximal independent
set (MIS-2) to generate disjoint vertex subsets. This pattern would result in one entry per row of P with
the column indices indicating the MIS-2 set each vertex is assigned. In smoothed aggregation these subsets
are extended to overlap to improve numerical properties of the solver at the expense of increasing the work
during multiplication. The cost of the analysis phase varies with the properties of the input matrices but
remains a relatively small overhead compared to the overall cost of SpGEMM.Although for some matrices,
such as the anisotropic horseshoe and square matrices, the analysis consumes more than 20% of the total
execution time the total improvement in the performance compared to the reference version is evident from
Table 3.4. The SpGEMM portion of the processing time completely encompasses the time required to process
the rows of C in a batch oriented manner based on the entries of the Cˆ. As a consequence of processing
the rows of C in ascending order according to the number of entries in Cˆ there is an additional overhead
in the form of reordering the final matrix. Though reordering increases the total time per operation it is
negligible compared to both the analysis and SpGEMM times. We note that in the special case where all
Cˆ row lengths are less than 32, processing of rows uses the natural ordering which avoids the overhead of
reordering C.
Table 3.4 presents the speedup of the optimized SpGEMM over the dataset outlined in Table 3.1. The
average speedup of our proposed method over the global processing approach utilized in the reference version
of the ESC algorithm is 3.1 for AB.The properties of the B operator, in this case, allow the product AB to be
favorable for the ordered approach for several reasons. As illustrated in Figure 3.4 many of the intermediate
row lengths are relatively small and may be processed completely within a single thread, warp, or CTA,
thus avoiding the cost of resorting to global memory operations. In addition, the small row lengths coupled
with the fact that B ∈ Rn×k, where k is typically a constant factor smaller then n, allows the intermediate
sorting routines to exploit the faster keys-only sorting variant by implanting permutation indices in the upper
59
Matrix Analysis
Expand
Sort
Contract
Reorder
1a. 2D FD, 5-point 4.6 14 28.7 86 0.0 0
1b. 2D FE, 9-point 7.5 16 39.8 84 0.0 0
2a. 3D FD, 7-point 5.8 10 52.0 90 0.0 0
2b. 3D FE, 27-point 17.9 11 146.7 87 3.4 2
3a. 2D FE, h ≈ 0.03 4.5 18 20.2 82 0.0 0
3b. 2D FE, h ≈ 0.02 8.0 7 110.6 92 1.7 1
3c. 2D FE, h ≈ 0.015 14.5 7 198.0 91 4.1 2
4. 3D FE, h ≈ 0.15 13.1 8 142.7 89 4.2 3
5a. 2D FE, horseshoe 4.9 17 24.3 83 0.0 0
5b. 2D FE, square 5.3 17 26.1 83 0.0 0
Table 3.3: Time (ms) and percentage in each phase of the optimized algorithm.
bitfield of the column indices.
Total Time
Matrix Ref Opt Cusparse
1a. 2D FD, 5-point 98.2 33.3 3.0 135.5 0.78
1b. 2D FE, 9-point 186.6 47.3 4.0 206.2 0.90
2a. 3D FD, 7-point 196.6 57.8 3.4 428.9 0.46
2b. 3D FE, 27-point 820.1 168.0 4.9 1633.0 0.50
3a. 2D FE, h ≈ 0.03 75.7 24.7 3.1 168.2 0.45
3b. 2D FE, h ≈ 0.02 166.6 120.3 1.4 368.8 0.45
3c. 2D FE, h ≈ 0.015 323.9 216.6 1.5 682.4 0.47
4. 3D FE, h ≈ 0.15 567.2 160.0 3.5 1269.2 0.45
5a. 2D FE, horseshoe 94.5 29.2 3.2 162.4 0.58
5b. 2D FE, square 95.0 31.4 3.0 412.9 0.23
Table 3.4: C = AB run time (ms) and speedups (h is an average diameter).
In Figure 3.8 the intermediate expansion and contraction factors for each of the matrices in Table 3.1
are presented as well as the corresponding standard deviation. It is clear that the maximum and minimum
intermediate factors may vary substantially between rows of Cˆ and therefore to achieve high efficiency the
SpGEMM method must adapt at runtime to accommodate these features.
The matrices outlined in Table 3.1 exhibit negligible variations in the number of entries per row in the
intermediate matrix. As shown in Figure 3.8 the standard deviation of the expansion phase is moderate for
many of the matrices and the mean expansion factor is less than 5.0 in all cases. These two factors impact
the sorting phase because together they imply that many of the intermediate rows are roughly of equal
length with the total number of entries in each row a small constant factor larger than the corresponding
row from A. As such these matrices may not fully capture the imbalances which may be present in more
general sparsity patterns giving rise to intermediate matrices with highly varying expansion, sorting, and
60
1a 1b 2a 2b 3a 3b 3c 4 5a
Matrix
0
5
10
15
20
25
30
35
E
x
p
an
si
on
F
ac
to
r
Min
Max
Mean
(a) Expansion
1a 1b 2a 2b 3a 3b 3c 4 5a
Matrix
0
2
4
6
8
10
12
14
16
C
on
tr
ac
ti
on
F
ac
to
r
Min
Max
Mean
(b) Contraction
Figure 3.8: SpGEMM expansion and contraction factors for test matrices.
contraction components.
To address this we conduct a similar set of tests using a small subset of the matrices outlined in the
GPU SpMV dataset [10]. The B operator was generated in a similar manner outlined in the previous
dataset — i.e., through an AMG interpolation matrix. This class of SpGEMM operations are susceptible to
extreme variations in the all phases which places an increased concern on the applicability of our proposed
optimizations to improve the performance.
In Table 3.5 we note that our optimized ESC achieves performance at least on par with the reference
version. A particularly challenging test case involves the Webbase matrix which originates from a scale free
graph and therefore generates a intermediate matrix with a rich diversity of row lengths. However, since
many of the rows are small (due to a power law) the total number of intermediate entries in Cˆ is not expected
to be large. This allows for the reference ESC method to process the entire operation in a single pass and
removing any sensitivity to the jagged nature of the workload.
Total Time
Matrix Ref Opt Cusparse
Cantilever 115.9 33.9 3.4 145.5 0.80
Spheres 170.2 41.8 4.1 247.3 0.69
Accelerator 67.9 30.1 2.2 216.2 0.31
Economics 45.0 37.5 1.2 67.5 0.67
Epidemiology 53.3 16.6 3.2 71.7 0.74
Protein 113.7 36.0 3.2 181.7 0.63
Wind Tunnel 205.7 50.2 4.1 446.2 0.46
QCD 83.2 28.5 2.9 96.7 0.86
Webbase 152.9 149.1 1.0 3091.3 0.05
Table 3.5: AB run time (ms) and speedups.
61
Finally in Table 3.6 we present data for C = A2 (cf. Table 3.5) to illustrate the effectiveness of our
method outside of the context of computing C = AB. Notably our method consistently outperforms the
reference ESC implementation on this dataset by utilizing shared memory more efficiently and achieving up
to almost seven times performance improvement. Compared to Cusparse our new method is comparable
in many cases and substantially outperforms Cusparse for matrices such as the Accelerator. Though we
cannot state definitively the reason for this considerable improvement we speculate it is connected to the
analysis phase of our optimized approach. During analysis it is discovered that 80% of the Cˆ rows generated
by A2 are less than 1024 elements for the Accelerator matrix. Adapting to this knowledge the optimized
ESC algorithm is capable processes this set of short rows in approximately 60 milliseconds. Conversely we
find that for the Protein matrix our approach still outperforms the reference version but is considerably
slower than Cusparse. During the analysis phase for this matrix we find that only half of the input rows
are capable of being processed in shared memory. The remaining rows must be processed using the global
memory variant which mimics the reference performance, yielding only modest performance.
Total Time
Matrix Ref Opt Cusparse
Cantilever 1979.6 390.5 5.1 469.1 4.2
Spheres 3410.6 706.5 4.8 1127.3 3.0
Accelerator 629.5 113.4 5.6 1052.9 0.6
Economics 63.2 40.9 1.5 136.5 0.5
Epidemiology 65.3 19.8 3.3 54.9 1.2
Protein 4111.9 3442.5 1.2 1227.7 3.4
Wind Tunnel 4641.0 628.5 7.4 1580.9 2.9
QCD 528.4 78.4 6.7 333.3 1.6
Webbase 571.9 308.5 1.9 1558.2 0.4
Table 3.6: A2 (ms) and speedups
3.8 Conclusion
In conclusion we have presented a new formulation of our global sort based SpGEMM operation that
exhibits notable speedup by exploiting the row-wise processing of the intermediate matrix. In order to study
and process the intermediate matrix more effectively we presented a reordering scheme to identify the number
of total entries per row of the intermediate matrix and adaptively tune the sorting implementation to reduce
the costs of global sorting in favor to localized schemes. While our method does not provide speedup in all
cases, we have shown that by performing a lightweight analysis phase it is possible to mitigate the overhead
of global memory in favor of shared memory operations. We note that na¨ıve strategies to selectively process
62
the SpGEMM computations based on isolated analysis of the input matrices does not adequately capture
the complexity of the SpGEMM and a combined approach considering contributions from both yields more
optimization opportunities. Though our experiments consisted of SpGEMM instance originating from AMG
we stress that this feature is immaterial with respect to our analysis and processing scheme because of our
explicit focus on the intermediate characteristics of the histogram model.
Though effective at utilizing faster shared on-chip registers and shared memory our method does not
achieve substantial improvement over the global memory approach when the total number of entries per row
in the intermediate format exceeds the maximum shared memory size. Specifically, with respect to AMG
two SpGEMM orderings are possible (PTA)P and PT (AP ). We perform our computations using the later
AP formulation as it naturally exposes more parallelism, in terms of rows, then the former. In future work
we plan to selectively process large rows in global memory more effectively by modeling the performance
trade-offs of more sophisticated schemes versus any additional analysis or processing overhead.
63
Chapter 4
Stable Sparse Matrix Operations
Operations that are naturally decomposed into a fixed number of equally sized components are inherently
amenable to parallel processing. However, processing operations in which the workload is highly irregular and
data-dependent is substantially more challenging. For fine-grained parallel processing environments, such as
GPUs, the issues of effectively decomposing and mapping irregular work to computational units is amplified
due to the hierarchical structure of the architecture and relatively small working set size of individual
threads. In this chapter we study the performance of three operations on sparse matrices, sparse matrix-
vector multiplication (SpMV), sparse matrix-sparse matrix addition (SpAdd), and sparse matrix-sparse
matrix multiplication (SpGEMM) and propose several strategies to achieve stable performance, meaning the
processing time is highly correlated with the total work and independent of the underlying structure of the
input matrices.
These computations typically form, or are isomorphic to, the building blocks to many applications involv-
ing graphs and sparse solvers therefore achieving performance on this set of operations is a key ingredient
to a large body of important problems. The primary challenge of computations involving sparse matrices
originates from workload imbalances induced by data segmentation. As an example, a binary operation on
sparse matrices, such as addition, could elicit a substantial amount of imbalance between computational
units if processed using a na¨ıve decomposition over rows or columns. More complex operations, such as
multiplication exacerbate the problem due to the data-dependent expansion of the computational work.
To address irregularity many methods have been introduced that structure memory accesses to improve
the performance of sparse matrix operations [10, 84, 86]. Arguably the most studied and optimized, in
terms of data-structures and algorithms, is SpMV. Peak SpMV performance is achieved by coupling sparsity
pattern analysis with specialized, and in some cases exotic, storage schemes tuned for a particular class of
matrices. The drawbacks of this approach are the need to perform analysis of the input matrix and the use
of storage formats that are not amenable for use in other operations. With respect to SpGEMM in Chapter 2
we introduce the ESC algorithm to distribute computational work for SpGEMM evenly between processors
64
using global memory as an intermediate buffer. This processing model is directly related to another growing
trend for expressing large scale computational workloads, MapReduce [29].
Map	   Shuffle	   Reduce	  
(a) MapReduce.
(0,0,10)	  
(1,1,20)	  
(1,2,30)	  
(1,3,40)	  
(2,3,50)	  
(3,1,60)	  
(0,0,1)	  
(1,1,2)	  
(1,3,3)	  
(2,0,4)	  
(2,1,5)	  
(3,1,6)	  
(3,3,7)	  
(0,0,10)	  
(1,3,60)	  
(1,1,40)	  
(1,1,150)	  
(1,0,120)	  
(1,3,280)	  
(1,1,240)	  
(2,3,350)	  
(2,1,300)	  
(3,3,180)	  
(3,1,120)	  
(0,0,10)	  
(1,0,120)	  
(1,1,40)	  
(1,1,150)	  
(1,1,240)	  
(1,3,60)	  
(1,3,280)	  
(2,1,300)	  
(2,3,350)	  
(3,1,120)	  
(3,3,180)	  
(0,0,10)	  
(1,0,120)	  
(1,1,430)	  
(1,3,340)	  
(2,1,300)	  
(2,3,350)	  
(3,1,120)	  
(3,3,180)	  
A	  
B	  
Compress	  Sort	  Expand	  
C	  Join	  &	  	  
Scale	  
(b) ESC SpGEMM.
Figure 4.1: Comparison of workflow between MapReduce and ESC SpGEMM algorithms.
The crucial component in the MapReduce framework is the reordering of the intermediate dataset to
group tuples by key following the map function in order to invoke the reduce function over a complete set of
duplicate tuples. Known as the shuffle or sorting phase this process is typically a limiting factor in many
MapReduce applications which may produce an intermediate set that is considerably larger than either or
all of the input instances individually. We illustrate this process in Figure 4.1a and simultaneously draw
parallels with the SpGEMM ESC algorithm in Figure 4.1b. In Chapter 3 we reduced the sorting time by
specializing the SpGEMM execution adaptively based on grouping intermediate rows into coarser subsets
according to the total number of entries on any row of the intermediate matrix and processing these subsets
hierarchically (without global memory communication) as depicted in Figure 4.2.
For a particular set of matrices a specialized algorithm-storage pair may yield asymptotically optimal
performance but the performance of a highly tuned pair may be invalid or exhibit unpredictable behavior
when applied to general matrices. We consider these approaches as segmentation aware since they incorpo-
rate matrix structure into the implementation. In this chapter we present optimizations for sparse matrix
computations that strive to maintain a balanced decomposition of work, as in MapReduce routines, while
improving the performance using segmented processing, as in adaptive histogram approaches. The work
presented in this chapter was completed in close collaboration with the developments on optimized routines
implemented in the Moderngpu library [8].
65
0 10 20 30 40 50 60 70
Row Size
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
R
ow
(a) Natural.
0 10 20 30 40 50 60 70
Row Size
1
8
12
24
26
6
29
21
18
14
13
22
7
4
30
10
3
11
5
27
23
28
32
31
25
17
9
2
20
19
15
16
R
ow
(b) A reordered.
0 10 20 30 40 50 60 70
Row Size
3
29
5
6
24
23
21
18
14
17
30
1
12
20
19
16
15
13
7
28
32
22
8
26
27
11
10
9
25
4
2
31
R
ow
(c) C reordered.
0 10 20 30 40 50 60 70
Row Size
1
12
14
6
29
21
24
8
18
30
3
5
26
7
22
23
11
13
4
27
10
17
20
2
32
31
28
19
15
9
25
16
R
ow
(d) Cˆ reordered.
Figure 4.2: Row-wise work in example matrix under the application of various permutations based on
the number of entries on each row in A, C, and Cˆ.
4.1 Sparse Operations
Predicting the performance of segmentation oriented sparse matrix operations is a difficult task because
of the variations in the input operands A and B that may elicit an unknown amount of irregular data
movement during exeuction. Recently, optimal lower-bounds were established [5] on the communication
required to perform SpGEMM on distributed systems predicated on the assumptions that the nonzeros are
distributed according to a random Erdo¨s-Re´nyi graph. A more basic model of representing SpGEMM as a
set of bipartite graphs from which the structure of the output operand, C, could be estimated based upon
the reachability of vertices [24].
To achieve stable performance in our implementation we avoid the use of segmentation aware algorithms
or specialized data structures. Our approach places load-balancing first and performs segmented reductions
66
progressively by introducing intermediate work. To achieve this we study balanced decompositions of each
sparse matrix workload and perform processing at the granularity of nonzero entries during all phases. In the
case of SpAdd and SpGEMM our processing scheme builds on highly regular merge-based sorting routines
to partition work into smaller units of roughly equal size for parallel processing. In contrast with traditional
metrics of interest, such as giga-floating point operations per second (GFLOPs/s), our method is not designed
to achieve the best absolute performance compared to other segmentation aware algorithms; indeed we show
the performance of our approach may be notably lower than other methods in some cases. However, we show
that our method is easily predictable for a variety of diverse inputs. This is important since the simplest
operation we consider, SpMV, involves a single sparse matrix yet is quite challenging to analyze in general,
therefore higher order binary operations involving sparse matrix pairs is significantly more complex.
Given two matrices A ∈ Rm×k, B ∈ Rk×n, and a vector x ∈ Rk×1 then SpMV, SpAdd, and SpGEMM
compute Ax, A+B, AB, respectively. In contrast with general (dense) matrix-matrix operations, A and B are
assumed to be sparse. Typical storage formats include coordinate (COO), where each nonzero is represented
as a (row, column, value) tuple, and compressed sparse formats that sort and compress the COO format
along either the row (CSR) or column (CSC) axis and provides offsets to the first entry in each segment.
We consider the following sparse matrices in our algorithmic descriptions for Sections 4.2,‘4.3 and 4.4,
A =

10 0 0 0
0 20 30 40
0 0 0 50
0 60 0 0

and B =

1 0 0 0
0 2 0 3
4 5 0 0
0 6 0 7

,
with tuple form given by
A =

(0, 0, 10)
(1, 1, 20)
(1, 2, 30)
(1, 3, 40)
(2, 3, 50)
(3, 1, 60)

and B =

(0, 0, 1)
(1, 1, 2)
(1, 3, 3)
(2, 0, 4)
(2, 1, 5)
(3, 1, 6)
(3, 3, 7)

.
4.2 SpMV
67
For SpMV we consider Ax = y, where A is in CSR format, and the total work is two times the number of
nonzeros in A, denoted as nnz(A). The obvious parallelization, known as scalar CSR, is to assign one thread
to each row of A and process each row independently. Unfortunately this work decomposition may exhibit
a large amount of imbalance between threads if rows of A vary significantly. One solution is to vectorize the
row-wise processing and enlist a fixed number of threads within a warp or CTA to process each row of the A.
This scheme is advantageous when: the average number of entries per row does not vary significantly, each
row contains a relatively large number of entries relative to the number of threads, and there are enough
rows to fully saturate the device. However, it may be impossible to globally select a static number of threads
appropriate for every row of A.
We avoid the irregularity of row-wise processing by assigning a fixed number of nonzeros, Nprods, per
CTA. To implement this strategy we decompose the SpMV kernel into three phases: partition, reduction,
and update. Although a flat nonzero decomposition is perfectly balanced row-wise segmentation is required
to enact the reduction correctly. Processing the matrices in COO format is one alternative but requires the
additional storage and movement of one row entry per nonzero. We address this issue by first introducing a
partitioning phase that computes the row-wise limits of the entries processed per CTA. For nnz(A) entries
we process the products using m = dnnz(A)/Nprodse CTAs. During the partitioning phase we enlist m
threads to cooperatively perform one binary search per CTA segment to locate the last offset in the row
offsets array processed by each CTA and storing these offsets in an auxillary global memory buffer, S.
After partitioning the rows the reduction phase launches m CTAs to reduce products within each segment
of A. A given CTA, i, processes entries in the range [i ∗Nprods,min(nnz(A), (i+ 1) ∗Nprods)] and row offsets
corresponding to S[i, i+1] are loaded into shared memory to generate the expanded row indices corresponding
to each nonzero. To compute the reduction we first load the column indices and values of A, in strided order to
reduce the penalty of global memory operations, into register. Then we dereference entries from x according
to the column indices and compute the scaled products. The products are transposed from strided to blocked
order using shared memory and a CTA wide segmented scan is performed. On encountering discontinuities
in the row indices partial sums are stored to y and the CTA remainder, corresponding to the last row, is
stored to a global memory buffer, r.
Lastly, during the update phase we accumulate the inter-CTA updates by performing a segmented scan
on r and augmenting the first reduction for each CTA with the carry out value from the previous CTA. This
approach exposes parallelism without regard for segment geometry allowing it to scale over a wide range of
sparsity patterns. We statically tune the number of entries per thread empirically to maximize throughput.
68
4.3 Addition
For SpAdd we consider the total work to form C as the sum of the number of nonzero entries in A and
B, nnz(A) + nnz(B). Two approaches for computing SpAdd are row-wise segmented reductions and global
sorting. In the row-oriented scheme each row of C is processed by a thread group. Though the decomposition
of work is simple and processing any output row i requires O(nnz(Arowi)+nnz(Browi)) work, it is challenging
to chose the number of cooperative threads per row. Conversely, if we utilize a global sorting approach the
total work to generate C is considered collectively as a single monolithic unit by combining nonzero entries
from both A and B into a intermediate matrix, Tˆ . By lexicographically sorting Tˆ by (row, column) tuples
all duplicate entries are adjacent and performing a simple tuple-wise reduction generates C. Utilizing the
stability and performance of radix sort primitives to first sort by column then row ensures no threads
are idle during each phase of processing and the impact of imbalance due to the irregularity is negligible.
Though not susceptible to penalties related to irregularity of A or B the complexity of globally sorting Tˆ is
O(k(nnz(A)+nnz(B)) which is k times more expensive than the accumulated cost of the row-wise approach,
O(nnz(A) + nnz(B)).
Ideally a method to construct C should mitigate the impact of load imbalance while attaining near optimal
work complexity. To achieve this requires a decomposition of work into distinct sets of non-overlapping
ranges from A and B such that each thread processes a unique set of entries in C, thus corresponding to
O(nnz(A) + nnz(B)) work complexity. Individual threads must be mapped onto a unique range of entries
in C and tasked with independently determining the set of entries constituting each row. We observe that
when both input matrices are sorted by row and row-wise internally sorted by column then sparse matrix
addition may be formulated as a more general operation requiring the union of sets. Considering the natural
comparison of tuples outlined in Algorithm 11 and the ordering of the input matrices then merging entries
appears to be a natural fit for SpAdd since it exposes both maximal and uniform parallelism independent
of the non-uniform nature of individual rows in A or B.
Algorithm 11: Tuple Ordering
parameters: T1=(row1,col1), T2=(row2,col2)
return: MIN(T1,T2)
if row1 < row2
return T1
if row2 < row1
return T2
if col1 <= col2
return T1
return T2
69
Recently merge path has been introduced as a efficient means of improving the performance of merge
based sorting algorithms, however, we forgo a detailed description of the merge path algorithm and provide
only a short introduction (e.g., see [39]). Merge path decomposes the work of merging two sorted list
by performing a binary search along the diagonals formed by orienting the lists on the x and y axes as
shown in Figure 4.3a. The ordering of entries along diagonals prescribe a partition of both lists into non-
overlapping regions of uniform size that can be processed in parallel by individual threads. Though SpAdd
appears merge-like, parallel merge path partitioning is inadequate. Each consecutive range of matching
(row, column) tuples must be available to the same thread and therefore always appear on the same side of
any diagonal. We therefore utilize balanced path partitioning, illustrated in Figure 4.3, which extends the
merge path decomposition to partition the inputs according to this additional constraint [8].
We describe the balanced path methodology by considering two sorted lists, A and B, consisting of
duplicate keys in place of sparse matrix tuples. The normal merge path implementation consumes all
duplicate keys in A before matching entries in B therefore constructing a merged result with duplicate entries.
In contrast balanced path partitioning assigns a rank to duplicate keys within each range of contiguous
duplicate occurrences of the input sequences. This enables the assignment of keys with matching ranks to
the same partition for serial merging within individual threads therefore ensuring the key-rank pairs emitted
by each thread during construction of the result is a unique non-duplicated member of the output list. In
Figure 4.3b the stair-step pattern of balanced path snaking its way through keys for two sorted lists highlights
the unique differences between the decompositions where we see a starred diagonal indicating a translation
of thread t1’s partition to include a matching key from B.
0 1 2 3 4 5 6
0
1
2
3
4
5
6
c
c
c
c
d
f
B
a b c c c e A
a b c c c c
c
c
c
d
e
f
t 0
t 1
t 2 t 3
(a) Merge path
0 1 2 3 4 5 6
0
1
2
3
4
5
6
c0
c1
c2
c3
d0
f0
B
a0 b0 c0 c1 c2 e0 A
a0 b0 c0 c0
c1 c1
c2 c2
c3
d0
e0 f0
t 0
t 1
t 2 t 3
(b) Balanced path
Figure 4.3: Comparison of decision paths for merge path and balanced path given two sorted sets A
and B having six elements each. The paths, and thus the outputs, are evenly partitioned among four
threads. For balanced path, the partition boundary between thread t0 and t1 is starred so that zipped
pairs are never split across threads.
70
As in merge path the balanced path diagonals begin at fixed intervals. However, the intervals may shift
to consume one additional or fewer elements than the prescribed number processed per thread to ensure
both elements of a matched key-rank pair are found on the same side of a diagonal partition. Diagonals,
mapping to entries (i, diag − i) in A and B, that need to be extended to include an additional element from
B, corresponding to a matched, pair are starred. The augmented starred balanced path diagonal now maps
to matched entries (i, diag − i+ diag?) in A and B therefore stealing one entry from the right partition and
depositing it in the left partition.
Note that performing this key-rank decomposition allows the implementation of not only unions but
many other set operations, such as intersection, difference and symmetric difference [8]. For SpAdd we focus
on the union of sets therefore the output list is comprised of only zeroth ranked entries from A and B. In
Figure 4.4 we illustrate the performance of our set oriented union operation on various array sizes and data
types. Using our balanced path approach we perfectly decompose the work required to identify and reduce
duplicates that enables throughput-oriented parallel processing at the granularity of individual keys from
the input sequences.
With respect to SpAdd we reduce the global memory overhead by computing the union of the matrices
in two phases. During the first phase the unique tuples in the union of A and B are counted and space
for C is allocated. In the second balanced path invocation the entries from each matrix are loaded into
shared memory and duplicate values are reduced within the CTA. Note that although for well-formed sparse
matrices A and B there are at most two duplicates that must be reduced to form each unique entry of C
our implementation is general enough to process a much wider range of input sets consisting of an arbitrary
number of duplicates.
4.4 Multiplication
SpGEMM is substantially more challenging than SpMV and SpAdd because of the added irregularity of
the workload to dynamically expand, during formation of the products, and contract, during reduction of
duplicates. In the context of GPUs it is imperative that any proposed SpGEMM method attempt to mitigate
the adverse affects of this induced load imbalance. To illustrate the possible degree of load imbalance
encountered during the SpGEMM operation for the matrix shown in Figure 4.5a we plot the number of
products-per-nonzero in Figure 4.5b for computing A2.
We consider the total work as a function of the number of products, Nprods, since it is a measure
of the total number of global memory operations necessary to fetch data all the required data from the
71
104 105 106 107
Number of inputs
1000
2000
3000
4000
5000
6000
7000
In
p
u
ts
p
ro
ce
ss
ed
/s
(1
06
)
keys-32
keys-64
pairs-32
pairs-64
Figure 4.4: Performance of union operation on sorted sets. Entries per input array are divided evenly.
(a) Structure of example matrix, A.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(b) Number of products per nonzero of A2.
Figure 4.5: Example matrix, A, and resulting structure of A2.
input matrices and it also describes the number of insertions/reductions required to form C. By far the
identification and reduction of duplicates during the contraction phase is the most challenging phase of
the computation and forms the primary performance limiting routine in many SpGEMM implementations.
Intuitively this reduction may be expressed in a segmented manner by processing each row individually.
Unfortunately this approach is susceptible to performance variations associated with the number of products
and duplicates per row of the intermediate matrix.
To combat the irregularity of contracting the intermediate matrix we propose a balanced approach based
on merge path partitioning of the total number products and split the processing into separate but stable
72

(0, 0, χ)
(1, 3, χ)
(1, 1, χ)
(1, 1, χ)
(1, 0, χ)
(1, 3, χ)
(1, 1, χ)
(2, 3, χ)
(2, 1, χ)
(3, 3, χ)
(3, 1, χ)

(a)
=

(0, 0, χ)
(1, 3, χ)
(1, 1, χ)
(1, 1, χ)
(1, 0, χ)
(1, 3, χ)


(1, 1, χ)
(2, 3, χ)
(2, 1, χ)
(3, 3, χ)
(3, 1, χ)

(b)
=

(0, 0, χ)
(1, 0, χ)
(1, 1, χ)
(1, 1, χ)
(1, 3, χ)
(1, 3, χ)


(1, 1, χ)
(2, 1, χ)
(3, 1, χ)
(2, 3, χ)
(3, 3, χ)

(c)
=

(0, 0, χ)
(1, 0, χ)
(1, 1, χ)
(1, 3, χ)


(1, 1, χ)
(2, 1, χ)
(3, 1, χ)
(2, 3, χ)
(3, 3, χ)

(d)
=

(0, 0, χ)
(1, 0, χ)
(1, 1, χ)
(1, 3, χ)
(1, 1, χ)
(2, 1, χ)
(3, 1, χ)
(2, 3, χ)
(3, 3, χ)

(e)
=

(0, 0, 10)
(1, 0, 120)
(1, 1, 190)
(1, 1, 240)
(1, 3, 340)
(2, 1, 300)
(2, 3, 350)
(3, 1, 120)
(3, 3, 180)

(f)
=

(0, 0, 10)
(1, 0, 120)
(1, 1, 430)
(2, 1, 300)
(3, 1, 120)
(1, 3, 340)
(2, 3, 350)
(3, 3, 180)

(g)
Figure 4.6: All of the intermediate indices are formed, (a), and partitioned into subsets of approximately
equal size, (b). χ represents unformed products corresponding to each intermediate entry. Each subset
is sorted by column index independently, (c), and duplicate entries are reduced, (d). All the subsets are
aggregated into a intermediate matrix that contains duplicates but is not sorted by row or column, (e).
The reduced matrix is sorted, (f), the products of the reduced entries are computed and the remaining
duplicates are reduced to form C, (g).
sorting phases, the first phase performed in shared memory and a second phase performed in global memory.
To partition the products we first specify the number of products that may be processed per CTA, NCTA,
then we compute the prefix-sum, S, of the size of each row of B, nnz(Browi), referenced by the column
indices of A. Following this preprocessing phase
⌈
Nprods
NCTA
⌉
CTAs are launched to independently expand and
sort all Nprods products. Each CTA locates the range of products it is assigned by performing a binary
search on S to locate the first corresponding nonzero entry from A greater than or equal to its CTA index
times NCTA. By decomposing A at the granularity of products our approach balances the work perfectly
across CTAs irrespective of the row-wise reductions dictated by the input matrices.
The first sorting phase reduces duplicates within each CTA using radix sort. Our key observation is
that because we expand entries in order according to column entries of A then contiguous entries within a
CTA remain ordered by row and duplicates are in adjacent locations following a single radix sort on the
column indices as illustrated in Figure 4.6, this is in contrast to the two-phase sorting approach described
in Chapter 2. Figure 4.7 illustrates the performance improvement achieved by reducing the total number
of radix sort operations within a CTA. We benchmark the performance using the radix sort routine in
the open-source CUB programming framework [57] for 32-bit data types with 128 threads per CTA and
11 entries per thread. For key-value pairs we observe that performing a single radix sort pass reduces the
number clock cycles by approximately a factor of two yielding notable performance improvement.
In Figure 4.7 we also compare the performance of radix sort when the number of sorted bits are reduced
from 28 to 12 bits. Based on this observation we aggressively optimize our sorting implementation to
minimize the processing time by exploiting the limited range of the column indices and sorting the minimum
73
2P
-P
ai
rs
1P
-P
ai
rs
1P
-K
ey
s
1P
(2
8-
bi
ts
)
1P
(2
4-
bi
ts
)
1P
(2
0-
bi
ts
)
1P
(1
6-
bi
ts
)
1P
(1
2-
bi
ts
)
Sorting method
1
2
3
4
5
C
lo
ck
cy
cl
es
(1
04
)
Figure 4.7: Clock cycles per CTA radix sorting operations for two-pass (2P) key-value pairs, one-pass
(1P) key-value pairs, and one-pass keys-only routines as the number of sorting bits are decreased.
number of bits required by computing dlog2(n)e, the number of columns in B. We further optimize our
performance by embedding the permutation bits in the unused upper range of the column indices when
possible. Using this approach we reduce the number of shared memory operations and perform a keys-only
CTA radix sort. The permutation computed by each CTA is then stored to global memory using 16-bit
integers to minimize the memory traffic and storage overhead. The first phase concludes with each CTA
scanning the sorted (row, column) pairs to identify unique entries and asynchronously storing the reduced
set to global memory. Note that following the first phase no products have been formed and the reduced
intermediate entries in global memory consist of possible duplicates that are completely unordered. To
compute the unique duplicates in C we perform a two pass global radix sort routine similar to the method
outline by the ESC algorithm. The notable difference is that the global memory sorting routine computes
only the permutation that sorts the pairs and does not perform any reordering of the, currently unformed,
intermediate products.
In the third phase the reduced intermediate products are formed by performing the expansion phase
a second time. In this version each CTA loads the permutation computed during the first expansion and
forms the intermediate products by loading the referenced values from the input matrices and permuting the
values through shared memory into sorted order. A shared memory segmented scan is performed to reduce
to ordered products according to duplicate indicator flags embedded in the permutation array. To reduce
the number of global memory load and store operations the reduced values are not stored randomly back
to global memory. The output location that orders the reduced products according to global duplicates,
74
computed during the second phase, is used to store entries in sorted order. The final step consists of forming
the values in C by performing a global memory reduce-by-key operation on the ordered products.
A key feature of this two-level decomposition is that both the CTA-wide and global memory sorting
operations are stable with respect to the irregularity of the underlying input matrices. By preprocessing
many of the duplicates in shared memory prior to the global memory pass the total work required to perform
the expensive two-pass radix sort operation is significantly reduced in cases where each block yields only a
small number of unique pairs, i.e., a large number of duplicate entries per CTA.
4.5 Numerical Results
In this section we evaluate the performance of our parallel processing strategy and compare our results
with two implementations for processing sparse matrix operations on GPUs, a Thrust-based version, in the
case of SpAdd and SpGEMM, and the hand tuned implementation a hybrid format [10] in the case of SpMV
along with equivalent routines expose by Cusparse, a closed source package for processing sparse matrix
operations on GPUs provided by Nvidia [64]. Our evaluation is performed on a set of SpMV matrices taken
from the University of Florida (UFL) sparse matrix collection outlined in Table 4.2 and the configuration of
our testing environment is described in Table 4.1. All of the performance data was collected using double
precision arithmetic with the input matrices resident in GPU memory.
CPU i7-3820 CPU 3.60GHz
GPU GTX Titan 0.88GHz
CUDA 6.5
GCC 4.8 -O3
Table 4.1: System configuration settings. On the GPU error checking and correction (ECC) is disabled.
4.5.1 SpMV
In Figure 4.8 we compare the SpMV performance of three different implementations operating on an input
matrix stored in CSR format. Although the performance for some matrices is substantially higher using
specialized storage formats our goal is to compare the achieved performance using a general and standard
storage format. For each matrix the performance reported is the average of 100 iterations. The reference
SpMV implementation uses the vectorized CSR implementation discussed in Section 4.2.
From Figure 4.8 we see that the relative performance between the three implementations varies with
respect to each matrix. We note that the performance of the our flat decomposition scheme remains com-
75
Matrix rows columns nnz avg/row
Dense 2000 2000 4 000 000 2000.0
Protein 36 417 36 417 4 344 765 119.31
Spheres 83 334 83 334 6 010 480 72.13
Cantilever 62 451 62 451 4 007 383 64.17
Wind Tunnel 217 918 217 918 11 634 424 53.39
Harbor 46 835 46 835 2 374 001 50.60
QCD 49 152 49 152 1 916 928 39.00
Ship 140 874 140 874 7 813 404 55.40
Economics 206 500 206 500 1 273 389 6.17
Epidemiology 525 825 525 825 2 100 225 3.99
Accelerator 121 192 121 192 2 624 331 21.65
Circuit 170 998 170 998 958 936 5.62
Webbase 1 000 005 1 000 005 3 105 536 3.11
LP 4284 1 092 610 11 279 748 2832.9
Table 4.2: Unstructured matrices taken from the UFL sparse matrix collection for performance testing.
D
en
se
Pr
ot
ein
Sp
he
re
s
Ca
nt
ile
ve
r
W
in
d
H
ar
bo
r
Q
CD Sh
ip
Ec
on
om
ics
Ep
id
em
io
lo
gy
A
cc
ele
ra
to
r
Ci
rc
ui
t
W
eb
ba
se LP
0
5
10
15
20
25
30
35
G
F
L
O
P
s/
s
Vec Cusparse Merge
Figure 4.8: Sparse matrix-vector performance, measured in GFLOPs/s, comparison for CSR format in
double precision for three implementation using vectorized CSR, Cusparse, and Merge.
petitive with the best performance of the other implementations in all tests except the Dense matrix. For
challenging matrices that exhibit a large degree of irregularity in the number entries per row, Webbase and
LP, we see that our flat decomposition achieves markedly better performance compared to the other imple-
mentations. To illustrate the connection between the required work, proportional to nnz(A) for SpMV, and
the performance we plot the processing time versus nnz(A) for each matrix in Figure 4.9. From Figure 4.9
we see that the performance of the Merge approach is highly correlated, with a correlation coefficient of
0.97, with number the number of nonzeros in each matrix and exhibiting only small perturbations from the
least-squares fit of the data points.
76
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Nonzeros (106)
−0.5
0.0
0.5
1.0
1.5
T
im
e
(m
s)
ρMerge = 0.97
ρCusparse = 0.84
Merge Cusparse
Figure 4.9: Comparison of SpMV performance to nnz(A). We use the correlation coefficient , ρ, as a
measure of the performance predictability as a function of nnz(A). For our flat decomposition ρ is 0.97
indicating that predicting the SpMV performance of untested matrices should be relatively accurate.
4.5.2 SpAdd
In Figure 4.10 we compare SpAdd, A+A, performance for the previously mentioned packages. The perfor-
mance is computed as the average speedup over 100 iterations with respect to the sequential implementation
using CSR format on the CPU. We note that the storage format for these tests are different between the
packages. The Thrust and Merge approaches operate on the input matrices stored in expanded COO for-
mat to facilitate processing at the granularity of individual nonzero entries while Cusparse operates on the
matrices in CSR format. The Thrust implementation uses the two-pass global sorting approach described
in Section 4.3.
In contrast with SpMV we see that both Cusparse and Merge perform significantly better than the Thrust
version on all matrices in the test suite. Another notable difference is the significant performance improve-
ment of Cusparse over the Merge implementation for the Dense, Protein, and Wind matrices. However,
for the remaining matrices in the test suite the performance of the two methods are comparable with the
exception of the Webbase and LP matrices.
In Figure 4.11 we again plot the performance versus the total work to understand the correlation between
the required work and the total processing time. The merge based approach is perfectly correlated with
the number of nonzeros in the input matrices. This is expected since the approach is based on parallel
decompositions that yield perfect balance irrespective of the segmentation of the underlying data. In contrast
the Cusparse implementation seems to diverge in an unpredictable manner from the least-squares fit for
certain matrices. Note that for two of the largest SpAdd instances Cusparse achieves speedup for one and
77
D
en
se
Pr
ot
ein
Sp
he
re
s
Ca
nt
ile
ve
r
W
in
d
H
ar
bo
r
Q
CD Sh
ip
Ec
on
om
ics
Ep
id
em
io
lo
gy
A
cc
ele
ra
to
r
Ci
rc
ui
t
W
eb
ba
se LP
0
5
10
15
20
25
S
p
ee
d
u
p
Thrust Cusparse Merge
Figure 4.10: SpAdd performance, measured in speedup versus sequential CPU implementation, for
Thrust(COO), Cusparse(CSR), and Merge(COO).
dramatic slowdown for the other compared to the merge based implementation.
0.0 0.5 1.0 1.5 2.0 2.5
2 × Nonzeros (106)
0
5
10
15
20
25
30
35
T
im
e
(m
s)
ρMerge = 1.0
ρCusparse = 0.68
Merge Cusparse
Figure 4.11: Comparison of SpAdd performance to the number of nonzeros. For the merge implemen-
tation our correlation coefficient of 1 implies a nearly perfect match between the work and processing
time irrespective of the underlying structure of the input matrices.
4.5.3 SpGEMM
In Figure 4.12 we compare SpGEMM, A2, performance for all three implementations. In the special case
of the nonsquare LP matrix we transpose the matrix and perform AAT . The performance is tabulated in
terms of the average speedup for 10 iterations versus the sequential CPU implementation in CSR format.
78
The global ESC performance is inherently independent of the input matrices however the absolute execution
time is high because of global memory operations involving the total number of products, specifically radix
sort. For many of the matrices the Cusparse and Merge approaches outperform ESC by a significant margin.
However, the performance of Cusparse degrades for the Economics, Circuit, Webbase, and LP matrices. In
contrast the Merge approach sustains performance improvement compared to ESC in all instances.
D
en
se
Pr
ot
ein
Sp
he
re
s
Ca
nt
ile
ve
r
W
in
d
H
ar
bo
r
Q
CD Sh
ip
Ec
on
om
ics
Ep
id
em
io
lo
gy
A
cc
ele
ra
to
r
Ci
rc
ui
t
W
eb
ba
se LP
0
1
2
3
4
5
6
7
8
S
p
ee
d
u
p
ESC Cusparse Merge
Figure 4.12: SpGEMM performance, measured in speedup versus sequential CPU implementation, for
double precision for ESC(COO), Cusparse(CSR), and Merge(CSR).
The weakness of stable SpGEMM methods operating on intermediate products is also illustrated in
Figure 4.12 where both the ESC and Merge approaches required more physical memory than the resource
constrained GPU could support. For the Dense test case the number duplicates per CTA is close to zero
therefore the second sorting phase, corresponding to the global memory pass, requires a significant amount
of temporary storage to order the expanded entries. For matrices with a relatively dense number of entries
per row segmented processing presents a notable advantage. Although both the Webbase and LP matrices
also have rows containing a relatively large number of entries the power-law distribution of the row lengths
in the case of Webbase and the small number of rows in LP, since it is computing AAT , ensures that on
average many of the CTAs contract a substantially large number of intermediate entries during the first
sorting pass.
In Figure 4.13 we plot the performance of the Merge and Cusparse implementations versus the total
number of products required to form C. Note that this analysis ignores the FLOPs performed to reduce
duplicate entries to form C. From Figure 4.13 we see that the performance of the two-level sorting approach
based on merge path decompositions is highly correlated with the total number of products. This regularity
implies that the performance of the Merge approach is easily predictable based on simple analysis of the
79
0 10 20 30 40 50 60 70
Number of products (106)
0
100
200
300
400
500
600
T
im
e
(m
s)
ρMerge = 0.98
(a) Merge
0 10 20 30 40 50 60 70
Number of products (106)
0
1000
2000
3000
4000
5000
T
im
e
(m
s)
ρCusparse = −0.02
(b) Cusparse
Figure 4.13: Comparison of SpGEMM performance to the number of products. Flat two-level sorting
results in performance that is highly correlated, ρ = 0.98, with the work expressed as a function of
number of products in the intermediate matrix.
0 10 20 30 40 50 60 70 80 90 100
Percent Time
Protein
Spheres
Cantilever
Wind
Harbor
QCD
Ship
Economics
Epidemiology
Accelerator
Circuit
Webbase
LP
M
at
ri
x
L
ab
el
Setup
Block Sort
Product Compute
Global Sort
Product Reduce
Other
449.36
393.97
215.28
467.49
125.26
74.29
351.87
25.23
24.00
99.64
21.59
184.47
96.26
T
ot
al
T
im
e
(m
s)
Figure 4.14: SpGEMM performance breakdown.
input matrices prior to execution of the operation. The observed Merge regularity is in contrast with
Cusparse which diverges considerably from the anticipated performance in two cases. We do not claim the
Cusparse implementation is inherently unpredictable but that the performance appears to be unrelated to
our interpretation of the required work, the number of products.
In Figure 4.14 we decompose the performance of the Merge SpGEMM performance into several phases.
During the Setup phase the number of entries on each row of B referenced by each column of A is scanned to
compute the number of products required for each nonzero entry in A. During Block Sort the products are
decomposed into a fixed number of entries and processed using a single radix sort pass within each CTA and
80
the resulting permutation is stored to global memory. The Global Sort phase organizes the reduced entries
identified by each CTA using two-radix passes and the Product Compute phase combines the local CTA
permutation with the global duplicate permutation to form the ordered reduced products in global memory.
In the last phase the final products are formed by performing a reduce-by-key operation on duplicates
in global memory. Additional overheads in terms of allocation and miscellaneous memory operations are
aggregated into Other.
From Figure 4.14 it is clear that the two sorting passes and the computation of the output products
constitute the bulk of the processing time for each matrix. Although there are significant variations in the
total relative time of each operation based on the matrix it is important to consider these variations with
respect to the total processing time, shown on the right axis. In cases where the total processing time is
small the variations are much larger than the instances with longer execution times. For matrices requiring
hundreds is milliseconds the relative time per phase is more stable.
4.6 Conclusion
By processing the entries at the granularity of individual units of work we have shown that flat partitioning
schemes provide a effective means of expressing irregular computations. Although the specific techniques
employed to perform each task varied substantially the critical focus of our study was the flat decomposition of
work irrespective of the underlying segmentation dictated by the input operands. Note that our method does
not achieve the best performance in all cases for any of our tests. Indeed, we expect when comparing any two
algorithms processing such irregular computations that subtle data-dependent performance characteristics
may have a large impact on the overall processing time. Our primary focus in this chapter is achieving
stable and easily predictable performance across a vast landscape of input instances. Given the extremely
large and varied structure of sparse matrices it is impossible to fully describe in a finite dataset the intricate
performance considerations of these complex operations. Our proposed method reduces these considerations
to their most simple form with only small variations associated with specific problem instances. Our approach
therefore represents a balance between performance and robustness to express sparse matrix reductions in a
stable manner amenable to prediction using simple linear models.
In future work we plan to address the deficiencies of stable SpGEMM methods by adaptively introducing
segmented approaches when necessary. Detecting specific cases, such as the Dense matrix, is relatively
simple but would also require a more detailed model to accurately predict the trade-off between the number
of entries processed in the segmented and stable regions of the algorithm.
81
Chapter 5
Graph Partitioning
Graph partitioning plays a prominent role in many scientific and information simulations. Graphs emerge
in many areas of science and engineering and partitioning is a central component in applications such as
VLSI design [74] and parallel computations [14]. The goal is to categorize the vertices of a graph into a
discrete number of disjoint sets of approximately equal size while simultaneously minimizing a cut metric
related to the connectivity between sets. Graph partitioning is known to be NP-complete [15], which has
motivated the development of many heuristics to quickly approximate reasonably good partitions. As a
testament to the ubiquity and importance of graph partitioning there is a considerable and extensive body
of literature concerning the subject [31, 47, 50, 69, 73]. The methods may be broadly decomposed into three
regimes: combinatorial, spectral, and multicommodity flow. We focus primarily on spectral methods in this
work because of their effectiveness, parallelism, and relative simplicity.
Spectral methods pose the graph partitioning problem as a system of linear equations, L, that alge-
braically represent the underlying structure of a given graph G. The eigenvalues of L approximate the
algebraic connectivity of G and the corresponding eigenvectors decompose the vertices into distinct non-
overlapping subsets. Though these methods are effective, the expense of computing eigenvectors of large,
sparse linear systems limits the utility. The major performance bottleneck stems from computing the eigen-
vector(s) associated with the smallest eigenvalue(s) of L for which traditional methods, such as Lanczos and
power iteration, typically require a substantial number of iterations.
In this chapter we present a method for efficient spectral partitioning on emerging throughput-oriented
architectures — e.g. GPUs. We construct a highly efficient implementation designed to leverage the high
levels of parallelism available on GPUs through several, well-understood operations on sparse data-structures:
breadth-first search (BFS) and sparse matrix-vector multiplication (SpMV). In combination, the resulting
algorithm exhibits both high throughput and a improved cut quality across a range of matrix applications
in comparison to classical spectral partitioning.
82
5.1 Background
Given a graph G = (V,E) with |V | = n vertices and |E| = m edges, the graph partitioning problem
is the decomposition of V into k subsets, V1, V2, . . . , Vk, such that Vi ∩ Vj = ∅ for i 6= j, |Vi| = n/k,⋂k
i=1 Vi = V , and the number of edges in E whose incident vertices belong to different subsets is minimized.
Current approaches to parallel graph partitioning predominantly rely on effective coarsening schemes to
reduce global search costs coupled with localized iterative updating methods to improve quality; this results
in methods that are both fast and effective.
One of the earliest and most well-known graph partitioning heuristics are spectral methods [32, 69]. In
spectral partitioning the spectrum of, L, a matrix projection of G, is used to decompose V into disjoint sets.
In contrast with combinatorial approaches [50], basing the decomposition strategy on algebraic characteristics
of the underlying graph allows spectral methods to achieve provably bounded cuts on planar graphs and
well-formed meshes such as those typically used in finite-element methods [78]. However, eigensolvers require
a large volume of computation which can lead to relatively low performance compared with combinational
approaches [47].
The computational dependency of eigensolvers necessitates both algorithmically and architecturally effi-
cient linear algebra operations to achieve notable performance. With the continued increase in the number
floating point operations that can be performed per memory operation it is well-known that the key to
performance is the utilization of many more FLOPs per fetched unit of memory, measured in terms of band-
width. However operations, such as SpMV, that lack the necessary FLOPs to ameliorate the memory access
costs are limited by the peak memory bandwidth achievable by the underlying architecture.
GPUs are well-suited for memory bound kernels since they achieve significantly higher bandwidth than
traditional CPU architectures. In contrast to CPU architectures, modern GPUs target parallel workloads
that emphasize task throughput. Therefore, harnessing the computational resources of the such processors
requires programmers to decompose algorithms into thousands or tens of thousands of separate, fine-grained
threads of execution. Recent studies show that GPUs achieve high throughput in processing operations on
sparse data-dependent and graph data structures [10, 11, 27, 42, 60]. However, achieving peak performance
typically depends on both algorithmic and data reorganization strategies to reduce highly irregular and
unaligned access to data which fail to fully utilize the GPU memory subsystem.
Previous approaches to graph partitioning on a GPU [2] reduce the costly operation of globally searching
for an initial partition by using a multilevel approach coupled with an optimized GPU coarsening imple-
mentation. This is then followed by computation of an initial partition and iterative refinement during
83
64
5
1
2
3
2 −1 −1
−1 3 −1 −1
−1 2 −1
−1 3 −1 −1
−1 −1 −1 3
−1 1


−0.4
−0.3
−0.1
0.2
−0.2
0.8


6
4
5
1
2
3
R6×6
projection
xF
projection
Compute
LxF = λ2xF
Combinatorial
Figure 5.1: The projection of G, into matrix representation L is followed by the computation of the
eigenvector, xF , corresponding to the smallest positive eigenvalue, λ2. Vector xF is projected into the
space of feasible partitions to decompose graph G into 2 subsets.
uncoarsening. Our proposed method computes an initial spectral based partition by combining two global
search methods. First, a high-throughput graph growing method based on generic BFS is used to compute a
low-quality initial partition. Finally, the BFS initial partition is used as the seed of our Lanczos eigensolver
which is shown to positively impact the convergence rate.
5.2 Spectral Partitioning
Let G = (V,E) be an undirected, unweighted, simple and connected graph, i.e., vertices in G do not have
self-edges or multiple edges between any unique pair of vertices. The Laplacian matrix, L, of G is defined
as the n × n symmetric, positive semidefinite matrix containing one row and column for each vertex with
values
Lij =

dvi if i = j
−1 if i 6= j and (i, j) ∈ E
0 otherwise.
(5.1)
for i, j = 1, . . . , n, where dvi is the degree of vertex vi.
84
We denote the space of feasible partitions as B = {x ∈ Z |xi = 1 or −1}n. For an arbitrary x ∈ B, if
vertex vi is in partition P1 then xi = 1, while xj = −1 if vj is in the second partition. For a given partition,
a cut is defined as any edge with each vertex belonging to a different partitions. Therefore the cut size, C,
associated with x is expressed as (cf. (5.1))
C = xTLx =
1
4
∑
(i,j)∈E
(xi − xj)2. (5.2)
By relaxing the discrete constraints on x from B to R and using the result from (5.2) an approximation
to the minimum cut is
min
x∈B
C = min
x∈B
1
4
xTLx ≥ 1
4
min
x∈Rn
‖x‖2=1
xTLx =
1
4
xTFLxF =
λ2
4
(5.3)
where λF denotes the smallest positive eigenvalue of L, known as the Fiedler value, and xF is the associated
eigenvector. The complete spectral partitioning process is illustrated graphically in Figure 5.1.
By a large margin the most expensive spectral partitioning phase is the computation of xF . Many
approaches have been proposed in the literature to improve the efficiency of computing eigenvectors of L. A
common family of methods for exterior eigenpair computations are Lanczos methods [72]. Given an initial
vector x(0) ∈ Rn Lanczos methods, outlined in Algorithm 12, project L onto the Krylov subspace
Km(L, x(0)) = Span{x(0), Lx(0), L2x(0), . . . , Lm−1x(0)} (5.4)
and approximate eigenpairs of L by those of T , the projected matrix, using a Rayleigh-Ritz procedure.
Though generally effective, Lanczos convergence is strongly dependent on the distribution of eigenvalues in
the spectrum of L and often requires a large number of iterations to achieve a modest level of accuracy if
they are tightly clustered.
5.3 Initial Guess
Lanczos requires the selection of an initial vector, x(0), to initialize the Krylov subspace. With no intuitive
knowledge of the eigenvector of interest a standard choice is a random vector, xR ∈ Rn, with unit 2-norm.
Algebraic construction of a vector x(0) that reasonably approximates xF is not feasible in general; indeed,
it is a challenging task to provide sharp, a priori bounds on the eigenvalues of L [55, 62, 76]. We propose
85
Algorithm 12: Lanczos
parameters: L, sparse matrix
x(0), initial vector
return: T , tridiagonal Rm×m projection of L
β(1) ← 0
v(1) ← x(0)/‖x(0)‖2
for j ← 1, 2, . . . ,m
u(j+1) ← Lv(j) {O(nnz(L)) SpMV operation}
u(j+1) ← u(j+1) − β(j)v(j−1) {v(0) = 0}
α(j) ← v(j)Tu(j+1)
u(j+1) ← u(j+1) − α(j)v(j)
β(j+1) ← ‖u(j+1)‖2
If β(j+1) = 0, stop
v(j+1) ← u(j+1)/β(j+1)
T =

α(1) β(2)
β(2) α(2)
. . .
. . .
. . . β(m)
β(m) α(m)

improving the quality of x(0) by considering the structural properties of G. The importance of x(0) to
Lanczos convergence is captured by the Kaniel-Paige convergence theory [46]. Let φ be the angle between
xF and the initial guess, namely cos(φ) = |xF ·x(0)|, and let the relative gap in the eigenvalue be denoted by
γ = (λ3 − λF )/(λmax − λ3). The Ritz approximation θF to λF after j iterations — i.e., the second smallest
eigenvector of T — is given by
λF ≤ θF ≤ λF + (λmax − λF ) tan (φ)
2
cj−1(1 + 2γ)
2 , (5.5)
where cj−1(x) is the Chebyshev polynomial of degree j − 1. Therefore the bound on the accuracy of θF in
(5.5) becomes worse as cos(φ) approaches 0 as expected — i.e., as x(0) and xF become orthogonal.
In order to exploit the high-throughput nature of the GPU we employ a combinatorial approach to
initialize x(0). To motivate this approach, consider the probability of individual edges inducing a cut and
the expected probability for a small class of well-defined graphs. From (5.2) and (5.3), the eigenvalues of L
are correlated with the number of edge cuts induced by partitioning G using the corresponding eigenvector.
Moreover, for well-shaped d-dimensional meshes spectral methods yield bisectors of size O(n1/d) [78], which
implies that the probability of a randomly chosen edge inG spanning the optimal partition of V isO(n1/d/m).
For a random partition xR, however, a random edge on average has a 50% probability of generating a cut
since the partitions of both vertices are chosen independently. Based on this observation we discard xR
86
as a viable seed for our Lanczos method and attempt to generate improved approximations with easily
computable and bounded edge cuts based on the structure of G.
Exploiting graph structure has been the basis of many efficient, O(m), partitioning strategies known
collectively as graph growing (GG) methods [22]. These methods decompose V based on structures generated
by simple graph operations, such as level sets in BFS traversal. GG methods select a random vertex v, assign
the first n/2 vertices visited during BFS traversal from v to P1 and the remaining vertices to P2. The quality
of partitions produced by GG depends on several choices: start vertex, traversal ordering, and redundant
iterations. Though selecting a random starting vertex is simple it has been shown that the partition quality
may be improved by finding two pseudo-peripheral vertices and simultaneously growing partitions as shown
in Figure 5.2. Two vertices are considered pseudo-peripheral if the shortest path connecting them closely
approximates the diameter of G. Growing partitions from two well-separated seeds typically induces a
partition with two connected components and a relatively small number of edges cuts as the example in
Figure 5.2 suggests.
3
0
3
2
1
2
3
1
2
3
3
2
2
2
3
3
21
1
1
2
3
4
3
2
2
3 3
3
2
3
3
3
3
1
3
2
4
3
2
32
3
3
3
1 3
(a) Random
2
3
1
4
2
3
5
2
2
0
3
1
3
3
4
5
23
3
3
2
3
3
4
4
3
5 3
0
1
2
4
3
4
4
1
5
4
6
4
22
5
2
4
3 1
(b) Pseudo-peripheral
Figure 5.2: The level set structure and edge cuts of partitioning a graph based on a BFS starting from
a random vertex as compared to two pseudo-peripheral vertices.
We use a technique similar to GG to generate an improved x(0) by selecting two pseudo-peripheral vertices
and decomposing V into two sets based on distance to the nearest pseudo-peripheral neighbor. In contrast
to GG methods, where growth is alternated in each partition to ensure the number of vertices in each set
is approximately equal, we perform two BFS operations to avoid the fine-grained locking to process vertices
in a specific order, which is not well-suited for massive parallelism. A BFS is performed from each pseudo-
peripheral vertex and vertices are associated with the pseudo-peripheral vertex that yielding the lowest level.
The resulting operation is highly parallel [60], but may result in unbalanced partitions or higher edge cuts
in comparison to the traditional GG methods. However, the total number of cuts and the probability of a
random edge generating a cut are still bounded by the boundary between two connected partitions, which
is small in comparison to the entire graph.
87
The quality of this approach is summarized in Figure 5.3. Here we compare cos(φ) using a normalized
initial guess x(0) generated randomly, xR, the level set of a BFS starting from a random vertex, and the GG
inspired pseudo-peripheral procedure. As expected, the random initial partition xR has inferior properties
in comparison to the alternative approaches. Notably the low quality approximate xF constructed using
pseudo-peripheral vertices consistently displays superior characteristics to the random BFS method.
0 1 2 3 4 5 6
cos(φ2) ×10−3
0
5
10
15
20
25
30
F
re
qu
en
cy
(a) Random
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
cos(φ2)
0
5
10
15
20
25
30
F
re
qu
en
cy
RAND
PSEUDO
(b) BFS
Figure 5.3: Angle between the exact precomputed Fiedler vector, xF , and the initial vectors generated
randomly (blue), BFS using one random source (red), and BFS starting from a pseudo-peripheral vertex
(green).
5.4 SpMV Method
Though providing an initial guess as presented in Section 5.3 may reduce the number of Lanczos iterations
there are two fundamental issues that must be addressed, the robustness of the a single guess and the efficiency
of SpMV operations. Robustness refers to two related issues concerning, first, the ability of Lanczos to
compute xF given x
(0) and secondly, the number of iterations required to resolve xF to a predefined degree
of accuracy. In the worst case since L is symmetric the eigenvectors are orthogonal and if x(0) is chosen such
that x(0)TxF = 0 then Lanczos will not converge to xF . This is a well-known issue and conventional wisdom
suggests the use of xR specifically to reduce the probability is this event [52]. In practice however even if
x(0) is orthogonal to xF the inexactness of floating-point arithmetic naturally introduces variations during
computation that will decrease the likelihood of sustained orthogonality over several iterations. A more
challenging question regards the number of iterations required for Lanczos to satisfy a specified stopping
criteria since the inherently random process of selecting x(0), regardless of the method, could result in large
88
variations in execution time and the quality of the solution.
Though x(0) impacts the number of Lanczos iterations it does not improve algorithm efficiency per
iteration. For i ∈ [1, n] SpMV operations perform
yi = yi +
∑
j∈NNZ(Li,:)
Lij × xj ,
where Li,: represents the i
th row of L, NNZ(Li,:) the set of column indices corresponding to nonzero entries
on a given row. Therefore accesses to x are dictated by the location of column indices on each row of L. This
data-dependent access pattern may undermine the effectiveness of the memory subsystem to utilize caching
or coalescing, increased bandwidth for accesses to contiguous memory regions, to mitigate the expense of
transferring data. This issue is exacerbated by the relatively small reuse of entries from x since for each
nonzero entry in L one entry from x is fetched to perform only one multiplication and addition. This low
data reuse precludes hiding the communication time with computation.
One simple solution to address robustness is to apply the Lanczos solver several times with different x(0)
in a similar fashion as GG methods discussed in Section 5.3. However this solution does not address the
SpMV efficiency per iteration and increases the overall costs by requiring sequential application of Lanczos.
Fortunately both issues are addressed if we perform simultaneous or block Lanczos iterations that operate
collectively on a group of vectors. Block Lanczos is known to be superior to generic Lanczos in terms
robustness and efficiency [37]. By increasing the set of orthogonal vectors per iteration duplicate or tightly
clustered eigenpairs are resolvable and operating on multiple vectors typically increases the efficiency of
memory operations.
Constructing X
(0)
BFS ∈ Rn×k, where k is the block size, requires additional consideration to select vectors
that form a orthonormal basis and reasonably approximate xF . Although utilizing pseudo-peripheral vertices
to construct a single guess was shown to yield a favorable x(0) it is not apparent that constructing subsequent
vectors in a similar manner would guarantee mutual orthogonality or quality improvement. Therefore we use
a comparatively less expensive method to initialize subsequent vectors with BFS level sets generated using
a random source vertex. Once all of the columns of X
(0)
BFS are initialized we apply a modified Gram-Schmidt
procedure with column pivoting to select a maximally orthogonal set of vectors. Linearly dependent vectors
are discarded and k is reduced. In Figure 5.4 the result of using this procedure on the Ritz values produced
by Lanczos are compared using both BFS and randomly generated X(0) matrices with k = 4. Though the
distribution of Ritz values appears globally similar after 20 iterations by focusing on the region containing
only the smallest values close the zero, the only region of interest, X
(0)
BFS has generated four significantly
smaller Ritz values than the random basis much earlier in the Lanczos sequence. Ritz values from the
89
0 10 20 30 40
Ritz value
0
5
10
15
20
It
er
at
io
n
(a) Full Spectrum
0.00 0.05 0.10 0.15 0.20
Ritz value
0
5
10
15
20
It
er
at
io
n
X
(0)
RAND
X
(0)
BFS
(b) Fiedler region
Figure 5.4: The distribution the Ritz values throughout the spectrum of L are nearly identical for
the both randomly generated, X
(0)
RAND, and BFS, X
(0)
BFS, based initial guesses. However, X
(0)
BFS yields
multiple Ritz values in the region containing the Fiedler value much sooner than X
(0)
RAND, which does
not remotely approximate the region of interest until iteration 20.
X
(0)
RAND do not begin to approach the region of interest until iteration 20.
Blocking also has a large impact on the performance of each SpMV operation, however special considera-
tions in the algorithm and storage format of the matrix. Along with general formats, such as CSR and COO,
more specialized formats have been proposed in the literature attempting to reduce the cost of accesses to
x, DIA and ELL, and mixed schemes that seek generality and performance, such as HYB. Since our BFS
algorithm naturally processing matrices represented in CSR format we avoid transforming to more specialize
SpMV formats and instead focus on algorithmic enhancements to improve blocked performance.
Our blocked SpMV is a augmentation of the vectorized CSR algorithm [10]. In the original algorithm
thread blocks are decomposed into smaller cooperative groups to process individual rows of the matrix and
the number of threads per group is a runtime parameter determined by the maximum number of entries
on any row of the matrix. With this decomposition accesses to x are distributed across multiple threads at
the expense of requiring a reduction through shared memory to accumulate partial sums within the thread
group. In our implementation the number of threads per group is dictated by the number of columns in
X. Each row is assigned to a single thread group and within each thread group a single thread processes
all entries corresponding to a single column of X. Since we store X in row-major format row accesses by
consecutive threads within a group are assured to be coalesced. By processing entries using a single thread
per column our block algorithm, outlined in Figure 5.5, loads entries from L once, avoids shared memory
reductions, and does not require any synchronization points. Note that values of L are not explicitly loaded,
or stored, in memory because of the graph Laplacian simplicity.
90
__global__ void
BlockSpmvKernel(const int num_rows ,
const int* row_offsets ,
const int* column_indices ,
const float* X,
float* Y)
{
__shared__ volatile int ptrs[VECTORS_PER_BLOCK ][2];
__shared__ volatile int sdata[VECTORS_PER_BLOCK * THREADS_PER_VECTOR ];
const int thread_id = blockDim.x * blockIdx.x + threadIdx.x;
const int vector_id = thread_id / THREADS_PER_VECTOR;
const int thread_lane = threadIdx.x & (THREADS_PER_VECTOR - 1);
const int vector_lane = threadIdx.x / THREADS_PER_VECTOR;
const int num_vectors = VECTORS_PER_BLOCK * gridDim.x;
for(int row = vector_id; row < num_rows; row += num_vectors){
if(thread_lane < 2)
ptrs[vector_lane ][ thread_lane] = row_offsets[row + thread_lane ];
const int row_start = ptrs[vector_lane ][0];
const int row_end = ptrs[vector_lane ][1];
// initialize sum to row length times x entry
float sum = X[row*THREADS_PER_VECTOR + thread_lane] * (row_end - row_start
);
for(int jj = row_start; jj < row_end; jj += THREADS_PER_VECTOR){
if(jj + thread_lane < row_end)
sdata[threadIdx.x] = column_indices[jj + thread_lane ];
for(int kk = 0; kk < min(THREADS_PER_VECTOR ,row_end -jj); kk++){
int col = sdata[vector_lane*THREADS_PER_VECTOR + kk];
// avoid loading L values assumed to be -1
sum += -1.0 * X[col*THREADS_PER_VECTOR + thread_lane ];
}
}
// each thread in group stores a single value
Y[row*THREADS_PER_VECTOR + thread_lane] = sum;
}
}
Figure 5.5: Block SpMV kernel for the vectorized CSR algorithm.
5.5 Experiments
In this section, we examine the performance of a GPU implementation of the proposed spectral partitioning
method. All of the operations are performed using single precision with error-correcting code (ECC) memory
support disabled and the times reported are the average for 10 iterations. Though ECC is an important
feature to mitigate random errors in DRAM memory it also decreases the peak achievable bandwidth on
the device. Our system is configured with CUDA v5.0 [67] and Thrust v1.7 [45], and all tests are performed
91
using an Nvidia Tesla C2075[65]. Table 5.1 outlines our suite of test matrices taken from the Walshaw
benchmark [77], a publicly available repository of graphs spanning various areas of various applications.
Notably for each matrix the best known number of cuts is presented that achieves 0% imbalance between
partitions.
Graph n nnz Best Graph n nnz Best
fe tooth 78 136 452 591 3816 fe rotor 99 617 662 431 2098
598a 110 971 741 934 2398 fe ocean 143 437 409 593 464
144 144 649 1 074 393 6486 wave 156 317 1 059 331 8677
m14b 214 765 1 679 018 3836 auto 448 695 3 314 611 10103
Table 5.1: Statistics for test matrices including the best known bipartitioning.
5.5.1 Convergence
The reliance of spectral partitioners on sparse eigensolvers complicates the selection of a appropriate stopping
criteria. Although xF partitions the vertices into disjoint sets it is not clear the level of accuracy necessary
to produce the optimal cut. As illustrated in Figure 5.6 the required accuracy of the xF approximation
to produce a near optimal number of edge cuts occurs much earlier than the measurable accuracy of the
eigenpairs may indicate. Since the objective of the spectral partitioner is to minimize the edge cuts we use
this to measure convergence and not eigenpair quantities. However as with traditional approaches there
is a cost associated with measuring convergence therefore checking too often decreases performance. We
therefore balance the trade-off between performance and accuracy by utilizing a fixed number of Lanczos
iterations, 10, in all computations.
The numerical quality of the produced solution is dependent on the initial vector as discussed in Sec-
tion 5.3. In Table 5.2 cos(φ2) is computed for each matrix in the test suite and for each of the proposed
methods of generation: BFS random vertex, BFS pseudo-peripheral vertex, and random. In many cases the
random initial guess is shown to be an order of magnitude worse than the alternative methods. Also it is
clear that using a pseudo-peripheral vertex is much effective than a random vertex though for some instance,
such as wave, the improvement is marginal.
5.5.2 SpMV vs. BFS
Though both operations are O(m) there is a trade-off in terms of both computation and storage overhead
between SpMV performance in the Lanczos iterations and required BFS iterations to construct the initial
set of vectors. In Figure 5.7 SpMV efficiency is shown to increase steadily with the number of vectors up to
32. In Table 5.3 we compare the performance of a single BFS operation indirectly with SpMV by tabulating
92
0 5 10 15 20 25 30 35 40 45
Iteration
10−15
10−13
10−11
10−9
10−7
10−5
10−3
10−1
101
E
rr
or
10
15
20
25
30
35
40
45
E
d
ge
cu
ts
|λ(i)2 − λ(i−1)2 |
‖Lx(i)2 − λ(i)2 x(i)2 ‖2
Edge cuts
Figure 5.6: Choosing an adequate stopping criteria is challenging because the metric of interest, edge
cuts, is not strongly correlated with the convergence properties of the eigenvalue problem. As illustrated
the edge cuts may reach the minimum well in advance of the eigenproblem accuracy quantities.
Matrix BFS Pseudo Random
fe tooth 0.1108± 0.0956 0.5690± 0.1498 0.0017± 0.0012
fe rotor 0.2609± 0.1819 0.7710± 0.0041 0.0015± 0.0007
598a 0.1700± 0.0520 0.8664± 0.0404 0.0017± 0.0014
fe ocean 0.2495± 0.1476 0.7953± 0.0173 0.0009± 0.0009
144 0.1157± 0.0745 0.7622± 0.0290 0.0010± 0.0008
wave 0.1010± 0.0560 0.3950± 0.0724 0.0009± 0.0007
m14b 0.1326± 0.0500 0.8350± 0.0295 0.0012± 0.0004
auto 0.1080± 0.0602 0.8134± 0.0336 0.0005± 0.0005
Table 5.2: A comparison of the angle, cos(φ2), each initial guess has with the Fiedler vector. Each
entry represents the average of 10 iterations along with the standard deviation.
the results for a single Lanczos iteration. Though each Lanczos iteration is dominated by SpMV costs we
choose to compare the BFS costs per total Lanczos time each iteration to highlight the overall number of
Lanczos iterations that are required to completely account for the BFS overhead. The reference Lanczos
performance is with respect to using single initial vector. As shown in Table 5.3 the time to compute a single
BFS operations in some instances, such as fe ocean, is significantly higher than a single Lanczos iteration
therefore using a large number of initial guess is relatively expensive in comparison with increasing the
number of Lanczos iterations.
Unlike SpMV the performance of BFS is dependent on the initial source vertex used to seed the operation.
93
fe
to
ot
h
fe
ro
to
r
59
8a
fe
oc
ea
n
14
4
wa
ve
m1
4b au
to
0
10
20
30
40
50
60
G
F
lo
p
/s
1 2 4 8 16 32
(a) GFlop/s
fe
to
ot
h
fe
ro
to
r
59
8a
fe
oc
ea
n
14
4
wa
ve
m1
4b au
to
0
20
40
60
80
100
120
140
G
B
yt
e/
s
1 2 4 8 16 32
(b) GByte/s
Figure 5.7: The efficiency of the SpMV operation for matrices in the test suite increases with the
number of columns, k, in the X ∈ Rn×k RHS matrix. Performance measured in terms of GFlop/s and
GByte/s is presented comparing performance for k = 1 to 32 columns in powers of two.
Matrix BFS Lanczos BFS/Lanczos
fe tooth 4.08 0.85 4.78
fe rotor 4.85 1.16 4.17
598a 4.54 1.61 2.82
fe ocean 7.02 0.89 7.93
144 4.80 1.89 2.54
wave 4.92 1.19 4.13
m14b 5.65 2.51 2.25
auto 8.71 5.33 1.64
Table 5.3: The cost of 1 BFS vs. 1 Lanczos iteration are compared over 100 iterations. The BFS/Lanc-
zos ratio represents the decrease in Lanczos iterations necessary to justify a single BFS generated initial
guess.
In Figure 5.8 we show that for the graph with the largest BFS/Lanczos ratio from Table 5.3, fe ocean, that
the BFS performance scales linearly with the number of computed levels during the traversal. Though this
variation in performance is unavoidable the relative difference in the BFS time versus the number of levels is
small with a difference of nearly 100 levels between the best and worst performance compared to difference
of less than 2 milliseconds in total time.
5.5.3 Performance
In Table 5.4 we compare the quality of our lower quality spectral bisection approach, using blocked Lanczos
and modified Gram-Schmidt (MGS) reorthogonalization on each iteration, with a high quality Fiedler vector
94
130 140 150 160 170 180 190 200 210 220
Number of Levels
6.0
6.5
7.0
7.5
8.0
8.5
T
im
e
(m
s)
Figure 5.8: BFS performance scales linearly with number of levels generated from the source vertex.
computation utilizing AMG preconditioned LOBPCG to compute the vector to a accuracy of 10−10 for
computing perfectly balanced partitions using the median cut criteria [51]. The LOBPCG edge cuts are
used as the reference and compared with 10 iterations of our approach using a set of 4 random initial guesses
and 4 BFS based initial guesses. We focus on 4 initial guesses since our SpMV implementation utilizes single
precision therefore each memory load and store operation utilizes all of the data fetched into the 128-bit
cache lines. To reduce the total time to form the BFS initial guesses we compute the level set starting
from a pseudo-peripheral vertex once, since computing pseudo-peripheral vertices requires at least three
BFS computations, than generate all subsequent level sets using a BFS level set starting from a randomly
chosen source vertex. We note that our SpMV implementation processes the blocked vectors in row-major
layout but our MGS orthogonalization scheme requires access to individual columns. We therefore incur a
moderate additional cost to transpose the blocked vectors from row to column major before and after each
SpMV operation. The difference in the number of edge cuts between the random set and the BFS based
iterations is shown to be roughly an order of magnitude in all test cases. Although the minimum number
of edge cuts computed using BFS initial guesses is in some cases quite close to the optimal number of cuts
computed using the reference the maximum number of cuts can be significantly larger. As shown in Table 5.4
the additional overhead of generating the BFS initial guesses on the total computational time may be roughly
twice that of the randomly seed process in some cases. This additional overhead however accounts for a
dramatic decrease in the number of edge cuts and represents a clear trade-off between performing additional
95
Lanczos iterations and BFS operations. Although applying the median cut method on the Fiedler vector
approximation from the randomly seeded Lanczos may result in disconnected partitions, since the vector has
not adequately converged, the BFS seeded Lanczos generates connected partitions in many cases with only
a small number of disconnected vertices appearing along the bisection boundary.
Matrix Ref Min Max Mean Time (ms)
fe tooth 4323.0
39587 42565 41101.8 23.71
4583 6062 5270.7 44.55
fe rotor 2419.3
79418 81017 80037.8 29.44
2550 2762 2615.9 59.76
598a 2502.0
52183 56316 54603.1 36.77
3279 4733 3786.0 70.15
fe ocean 647.0
31083 33602 32613.6 31.69
967 2323 1586.8 78.31
144 7279.8
73703 78891 76037.5 43.92
7699 9856 8568.3 81.43
wave 9856.6
96952 104028 100160.2 37.64
10706 17878 14307.2 64.65
m14b 4079.2
126762 138025 131459.3 60.34
4617 13740 9692.1 98.21
auto 13110.6
272765 283336 278362.2 125.54
17529 28344 23439.7 181.61
Table 5.4: The minimum, maximum and mean number of edge cuts obtained after 10 iterations of
Lanczos using 4 initial vectors generated using random (top) and BFS level sets (bottom) averaged over
10 runs. Results are compared with a reference number of edge cuts generated using AMG preconditioned
LOBPCG with a tolerance of 10−10.
5.6 Conclusion
By using simultaneous BFS initial guesses to seed the spectral partitioning method we have reduced the total
time to compute a bisection of the graph. Although the quality of the partitions computed using spectral
methods is heavily dependent on the structure of the graph our primary area of interest has been the use of
these methods on computational meshes, which have been theoretically proven to be amenable to spectral
decompositions. Using several initial guesses improves the robustness of our proposed scheme and allows
the use of high performance blocked SpMV methods to increase the performance of these bandwidth-limited
operations. The quality of our bisection was shown to be comparable with traditional schemes based on
using higher quality approximations to the Fiedler vector. In future work we plan to investigate adaptively
choosing the optimal number Lanczos iterations to balance cut quality and performance, using recursive
96
spectral partitioning to decompose a graph into multiple subsets by utilize higher order eigenvectors of the
graph Laplacian and integrating parallel refinement techniques based on maximum flow to reduce the number
of edge cuts along the boundary between neighboring partitions.
97
Chapter 6
Summary and Future Directions
In this dissertation we have focused on the fine-grained parallel decomposition of AMG solvers amenable
for execution on emerging throughput-oriented GPU architectures. We demonstrated that our GPU AMG
implementation exposes substantial fine-grained parallelism in each component during construction of the
hierarchy in the setup phase and iterative cycling of the hierarchy in the solve phase. Remarkably our
AMG implementation is fully expressible using a small set of generic primitives and required a single GPU
specific CUDA kernel, SpMV, for performance purposes during the solve phase. We demonstrated that
parallel primitives are a viable substrate for describing complex sparse matrix operations and represent a
productive, efficient and portable method for simultaneously targeting CPUs, GPUs and any architecture
supporting a small set of operations. In particular, we described highly-parallel methods for SpGEMM and
aggregation.
Though performance improvement was expected during the cycling phase we also achieved notable im-
provement during the AMG setup phase. The ability to perform the setup phase directly on the GPU allows
our AMG solver to simultaneously leverage the distinct architectural advantages of operating on the device
as well as avoiding relatively expensive operations to copy data across the PCI-express bus. Whereas CPU
implementations of major components such as sparse matrix-vector multiplication, sparse matrix-matrix
multiplication, and sparse matrix transposition, derive little or no benefit from multi-threading, our im-
plementations leverage scalable parallel primitives. In this dissertation we have focused on the necessary
components to perform smoothed aggregation based AMG on the GPU therefore future directions include the
exploration of the plethora of heuristics derived in AMG literature. Though many highly effective strength
of connection and coarse grid construction methods have been presented it is not readily apparent if they
may be efficiently mapped to data-parallel operations necessary for GPUs.
This dissertation also presented a global sort based SpGEMM operation based on three simple operations,
expansion, sorting, and contraction. It was shown that by distilling SpGEMM into optimized primitives that
many of the challenging aspects of these highly irregular data-dependent workloads are simplified since all
operations are performed over aggregated intermediate data. However, the process of constructing and
98
sorting the intermediate matrix in global memory was identified as the primary performance limiting factor
of our implementation. We therefore proposed and evaluated a selective segmented processing approach
that utilizes a lightweight analysis procedure to order processing of the intermediate matrix into uniform
subsets. This regularization proved to be most beneficial for workloads with a moderate number of products
per row allowing efficient processing completely within fast shared on-chip memory. Finally we considered
an alternative SpGEMM implementation that ignored segmentation an utilized a two phase sorting scheme
to reduce the total processing time of the ESC algorithm. We showed that this strategy maintained the
performance stability of the ESC algorithm while decreasing the overall processing time dramatically in some
cases. There are numerous lines of future work concerning SpGEMM and other sparse matrix operations on
GPUs. In this work we focused on analysis and accelerating SpGEMM using traditional storage formats,
such as COO and CSR, but many computational packages support specialized SpMV storage formats. It is
unclear how the use of these formats will impact the implementation and performance of SpGEMM on GPUs.
In the context of AMG we have considered performing the coarse matrix construction using two SpGEMM
operations. It is an open question if it possible perform both SpGEMM operations simultaneously to reduce
the memory overhead associated with forming the intermediate matrix and improve the performance.
Finally we studied spectral graph partitioning on GPUs and presented an accelerated processing scheme
based on simultaneous BFS initial guesses. Using this approach we reduced the total time to compute
a bisection of a given graph. The quality of our bisection was shown to be comparable with traditional
schemes utilizing higher quality Fiedler vector computations. In future work we plan to investigate computing
recursive spectral partitioning to decompose a graph into multiple subsets by utilize higher order eigenvectors
of the graph Laplacian and integrate parallel refinement techniques based on maximum flow to reduce the
number of edge cuts along the boundary between neighboring partitions.
99
Appendix A
Cusp : A C++ Templated Sparse
Matrix and Graph Library
A software-based achievement of this thesis is the joint development of Cusp [9], an open-source C++ iterative
solver package developed in collaboration with Nvidia. All of the computational experiments in this work
were performed inside the Cusp library and all algorithms developed as part of this thesis are publicly
available as open-source software. Cusp is the numerical algorithm developer’s analog to the generic STL-
style GPU programming interface exposed through Thrust. This programming framework provides high-
level abstractions to common data types and operations specifically targeting basic linear algebra (BLAS)
routines involving dense and sparse matrices. With the exception of specialized implementations to improve
the performance of critical operations, such as SpMV and SpGEMM, all operations in Cusp are expressed
without resorting to specialized GPU programming languages, such as CUDA.
Cusp provides facilities for performing a wide range of operations on both graph and sparse matrices.
As a result of this work Cusp was the first package to provide a fully GPU enabled smoothed aggregation
preconditioner that performed both the setup and solve phase directly on the device without relying on
support from the CPU. The efficiency of the package is ensured by expressing all computations using
abstract iterators exported by the Thrust [45] library. Cusp implements several Krylov methods for solving
both symmetric and non-symmetric linear systems using either diagonal, approximate inverse or smoothed
aggregation preconditioners. Figure A.1 provides a example program for reading a sparse matrix stored in
matrix market format from disk and solving the linear system using conjugate gradient. The example utilizes
the specialized hybrid format since one of the key performance limiting components of Krylov methods are
SpMV operations, therefore Cusp natively supports several storage formats.
SpGEMM is a challenging operation to express efficiently on parallel platforms. One novel aspect of the
ESC algorithm is that it can be completely distilled into a small set of Thrust routines for processing on the
CPU or GPU. This achievement signifies one of the critical advantages of utilizing a general framework for
writing applications targeting emerging parallel architectures since anyone could express extremely complex
operations for GPUs without knowledge of domain specific parallel programming languages or architecture
related idiosyncrasies.
100
#include <cusp/hyb_matrix.h>
#include <cusp/io/matrix_market.h>
#include <cusp/krylov/cg.h>
int main(void)
{
// create an empty sparse matrix structure (HYB format)
cusp:: hyb_matrix <int , float , cusp:: device_memory > A;
// load a matrix stored in MatrixMarket format
cusp::io:: read_matrix_market_file(A, "5pt_10x10.mtx");
// allocate storage for solution (x) and right hand side (b)
cusp::array1d <float , cusp:: device_memory > x(A.num_rows , 0);
cusp::array1d <float , cusp:: device_memory > b(A.num_rows , 1);
// solve the linear system A * x = b with the Conjugate Gradient method
cusp:: krylov ::cg(A, x, b);
return 0;
}
Figure A.1: Construct a sparse matrix in device memory and solve the linear system using conjugate
gradient.
A	   B	  
1 2 5 1 2 1 5column	  
scale	  
store	  
value	  
1 1 1 2 2 5 5row	  
1	   2	   3	   4	   5	   6	  
1	  
2	  
3	  
4	  
5	  
6	  
Figure A.2: Expansion and scaling of rows from B corresponding to nonzero entries of A into inter-
mediate buffer.
SpGEMM requires adaptively gathering rows from matrix B according to the column indices of matrix
A as described in Chapter 3 and depicted in Figure A.2. We express this process using Thrust by first
computing the exclusive scan of the B row sizes referenced by the column indices of A, segment lengths,
and performing a conditional scatter followed by a inclusive scan to compute the A locations corresponding
to each product constituting each intermediate entry. A similar process is used to compute the B location
for each product. We initialize A gather locations to 0 in order to repeatedly reference the base entry for
each product in the expanded set in contrast to B gather locations, which we initialize to 1, to ensure that
101
successive products corresponding to the same A index are gathered in order as illustrated in Figure A.3.
// compute A gather locations of intermediate format
thrust ::fill(A_gather_locations.begin (), A_gather_locations.end(), 0);
thrust :: scatter_if(
thrust :: counting_iterator <T>( begin_segment),
thrust :: counting_iterator <T>( end_segment),
output_ptr.begin () + begin_segment ,
segment_lengths.begin () + begin_segment ,
A_gather_locations.begin() - output_ptr[begin_segment ]);
thrust :: inclusive_scan(A_gather_locations.begin (), A_gather_locations.end(),
A_gather_locations.begin (), thrust ::maximum <T>());
// compute B gather locations of intermediate format
thrust ::fill(B_gather_locations.begin (), B_gather_locations.end(), 1);
thrust :: scatter_if(
thrust :: make_permutation_iterator(B_row_offsets.begin (), A.column_indices.
begin ()) + begin_segment ,
thrust :: make_permutation_iterator(B_row_offsets.begin (), A.column_indices.
begin ()) + end_segment ,
output_ptr.begin () + begin_segment ,
segment_lengths.begin () + begin_segment ,
B_gather_locations.begin() - output_ptr[begin_segment ]);
thrust :: inclusive_scan_by_key(
A_gather_locations.begin(),
A_gather_locations.end(),
B_gather_locations.begin(),
B_gather_locations.begin());
Figure A.3: Compute gather locations from A and B.
In Figure A.4 we see that following computation of the gather indices the referenced (row, column) pairs
are copied into the intermediate buffers, I and J, and the intermediate products are also formed by applying
a multiply transform on the permuted values of the A and B matrices according to the gather locations.
The intermediate indices are then sorted by both row, I, and column, J, indices using a two-pass radix sort
routine in Cusp’s sort by row and column function.
102
// Gather row and column indices of intermediate matrix into array I and J,
respectively.
thrust :: gather(A_gather_locations.begin (), A_gather_locations.end(),
A.row_indices.begin(),
I.begin());
thrust :: gather(B_gather_locations.begin (), B_gather_locations.end(),
B.column_indices.begin(),
J.begin());
thrust :: transform(
thrust :: make_permutation_iterator(A.values.begin (), A_gather_locations
.begin()),
thrust :: make_permutation_iterator(A.values.begin (), A_gather_locations
.end()),
thrust :: make_permutation_iterator(B.values.begin (), B_gather_locations
.begin()),
V.begin (),
thrust ::multiplies <V>());
// sort (I,J,V) tuples by (I,J)
cusp:: detail :: sort_by_row_and_column(I, J, V);
Figure A.4: Collection oriented SpGEMM sorting implementation.
Prior to allocating storage for the output we compute the number of unique entries in the output matrix
by counting the number of discontinuities in the expanded entries using Thrust’s inner product routine.
Finally, storage for the output matrix is allocated and products corresponding to duplicate entries in the
expanded format are reduced using reduce by key. By expressing all of the operations in terms of generic
primitives this implementation yields low variability in performance variability across a range of SpGEMM
instances.
103
// compute unique number of nonzeros in the output
T NNZ = thrust :: inner_product(
thrust :: make_zip_iterator(
thrust :: make_tuple(I.begin (), J.begin ())),
thrust :: make_zip_iterator(thrust :: make_tuple(I.end (), J.end()))
- 1,
thrust :: make_zip_iterator(thrust :: make_tuple(I.begin (), J.begin ()))
+ 1,
T(0),
thrust ::plus <T>(),
thrust :: not_equal_to < thrust ::tuple <T,T> >()) + 1;
// allocate space for output
C.resize(A.num_rows , B.num_cols , NNZ);
// sum values with the same (i,j)
thrust :: reduce_by_key
(thrust :: make_zip_iterator(thrust :: make_tuple(I.begin(), J.begin())),
thrust :: make_zip_iterator(thrust :: make_tuple(I.end(), J.end())),
V.begin(),
thrust :: make_zip_iterator(thrust :: make_tuple(C.row_indices.begin (), C.
column_indices.begin ())),
C.values.begin(),
thrust ::equal_to < thrust ::tuple <T,T> >(),
thrust ::plus <V>());
Figure A.5: Collection oriented SpGEMM reduction implementation.
104
References
[1] M. Adams, M. Brezina, J. Hu, and R. Tuminaro. Parallel multigrid smoothing: polynomial versus
Gauss-Seidel. J. Comput. Phys., 188(2):593–610, 2003. ISSN 0021-9991.
[2] B. F. Auer. Gpu acceleration of graph matching, clustering, and partitioning. 2013.
[3] A. H. Baker, T. Gamblin, M. Schulz, and U. M. Yang. Challenges of scaling algebraic multigrid across
modern multicore architectures. In Proceedings of the 25th IEEE International Parallel and Distributed
Processing Symposium (IPDPS 2011), 2011.
[4] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object
oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern
Software Tools in Scientific Computing, pages 163–202. Birkha¨user Press, 1997.
[5] G. Ballard, A. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication
optimal parallel multiplication of sparse random matrices. Technical Report UCB/EECS-2013-13, EECS
Department, University of California, Berkeley, Feb 2013. URL http://www.eecs.berkeley.edu/
Pubs/TechRpts/2013/EECS-2013-13.html.
[6] R. E. Bank and C. C. Douglas. Sparse matrix multiplication package (SMMP). Advances in Computa-
tional Mathematics, 1(1):127–137, 1993. ISSN 1019-7168.
[7] M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on GPUs. IBM
Research Report RC24704, IBM, Apr. 2009.
[8] S. Baxter. Mgpu : Design patterns for gpu computing. http://nvlabs.github.io/moderngpu/, 2014.
Version 1.1.
[9] N. Bell and M. Garland. CUSP : Generic parallel algorithms for sparse matrix and graph com-
putations. http://code.google.com/p/cusp-library/, 2009-. URL http://code.google.com/p/
cusp-library/.
[10] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented
processors. In SC ’09: Proceedings of the Conference on High Performance Computing Networking,
Storage and Analysis, pages 1–11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. doi:
http://doi.acm.org/10.1145/1654059.1654078.
[11] N. Bell, S. Dalton, and L. Olson. Exposing fine-grained parallelism in algebraic multigrid methods.
SIAM Journal on Scientific Computing, 34(4):C123–C152, 2012. doi: 10.1137/110838844. URL http:
//epubs.siam.org/doi/abs/10.1137/110838844.
[12] G. E. Blelloch. Vector models for data-parallel computing. MIT Press Cambridge, Mass, 1990.
[13] J. Bolz, I. Farmer, E. Grinspun, and P. Schro¨oder. Sparse matrix solvers on the GPU: conjugate
gradients and multigrid. ACM Trans. Graph., 22(3):917–924, 2003. ISSN 0730-0301. doi: http://doi.
acm.org/10.1145/882262.882364.
[14] E. G. Boman and M. M. Wolf. A nested dissection partitioning method for parallel sparse matrix-vector
multiplication. In HPEC, pages 1–6, 2013.
105
[15] T. N. Bui and C. Jones. Finding good approximate vertex and edge partitions is np-hard. Inf. Process.
Lett., 42(3):153–159, May 1992. ISSN 0020-0190. doi: 10.1016/0020-0190(92)90140-Q. URL http:
//dx.doi.org/10.1016/0020-0190(92)90140-Q.
[16] A. Buluc and J. Gilbert. On the representation and multiplication of hypersparse matrices. In Parallel
and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–11, 2008.
doi: 10.1109/IPDPS.2008.4536313.
[17] A. Buluc¸ and J. R. Gilbert. Highly parallel sparse matrix-matrix multiplication. CoRR, abs/1006.2183,
2010.
[18] A. Buluc¸ and J. R. Gilbert. The combinatorial blas: design, implementation, and applications. IJHPCA,
25(4):496–509, 2011.
[19] J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on
gpus. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP ’10, pages 115–126, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-877-3.
doi: 10.1145/1693453.1693471. URL http://doi.acm.org/10.1145/1693453.1693471.
[20] E. Chow, R. D. Falgout, J. J. Hu, R. S. Tuminaro, and U. M. Yang. A survey of parallelization
techniques for multigrid solvers. Parallel processing for scientific computing, pages 179–202, 2006.
[21] M. Christen, O. Schenk, and H. Burkhar. General-purpose sparse matrix building blocks using the
NVIDIA CUDA technology platform. In First Workshop on General Purpose Processing on Graphics
Processing Units, Northeastern Univ., Boston, 2007.
[22] P. Ciarlet, Jr., and F. Lamour. On the validity of a front-oriented approach to partitioning large sparse
graphs with a connectivity constraint, 1994.
[23] A. J. Cleary, R. D. Falgout, V. E. Henson, J. E. Jones, T. A. Manteuffel, S. F. McCormick, G. N.
Miranda, and J. W. Ruge. Robustness and scalability of algebraic multigrid. SIAM Journal on Scientific
Computing, 21(5):1886–1908, 2000.
[24] E. Cohen. On optimizing multiplications of sparse matrices. In IPCO, pages 219–233, 1996.
[25] J. M. Cohen and M. J. Molemaker. A fast double precision CFD code using CUDA. In 21st International
Conference on Parallel Computational Fluid Dynamics (ParCFD2009), 2009.
[26] N. Corporation. Cublas library, May 2010. URL http://developer.nvidia.com/cublas. version 3.1.
[27] S. Dalton, N. Bell, and L. Olson. Optimizing sparse matrix-matrix multiplication for the gpu. Technical
report, Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign,
Illinois, February 2013.
[28] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38
(1):1:1–1:25, Dec. 2011. ISSN 0098-3500. doi: 10.1145/2049662.2049663. URL http://doi.acm.org/
10.1145/2049662.2049663.
[29] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM,
51(1):107–113, Jan. 2008. ISSN 0001-0782. doi: 10.1145/1327452.1327492. URL http://doi.acm.org/
10.1145/1327452.1327492.
[30] E. F. DAzevedo, M. R. Fahey, and R. T. Mills. Vectorized sparse matrix multiply for compressed row
storage format. In International Conference on Computational Science (ICCS), pages 99–106. Springer,
2005.
[31] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In
Proceedings of the 19th Design Automation Conference, DAC ’82, pages 175–181, Piscataway, NJ, USA,
1982. IEEE Press. ISBN 0-89791-020-6. URL http://dl.acm.org/citation.cfm?id=800263.809204.
106
[32] M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(98):298–305, 1973.
[33] M. Garland and D. B. Kirk. Understanding throughput-oriented architectures. Commun. ACM, 53:
58–66, November 2010. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/1839676.1839694. URL
http://doi.acm.org/10.1145/1839676.1839694.
[34] M. W. Gee, C. M. Siefert, J. J. Hu, R. S. Tuminaro, and M. G. Sala. ML 5.0 smoothed aggregation
user’s guide. Technical Report SAND2006-2649, Sandia National Laboratories, 2006.
[35] J. R. Gilbert, S. Reinhardt, and V. B. Shah. A unified framework for numerical and combinatorial
computing. Computing in Science and Engg., 10(2):20–25, Mar. 2008. ISSN 1521-9615. doi: 10.1109/
MCSE.2008.45. URL http://dx.doi.org/10.1109/MCSE.2008.45.
[36] D. Go¨ddeke, R. Strzodka, J. Mohd-Yusof, P. S. McCormick, H. Wobker, C. Becker, and S. Turek. Using
GPUs to improve multigrid solver performance on a cluster. International Journal of Computational
Science and Engineering, 4(1):36–55, Nov. 2008. doi: 10.1504/IJCSE.2008.021111.
[37] G. H. Golub and C. F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins University Press,
Baltimore, MD, USA, 1996. ISBN 0-8018-5414-8.
[38] N. Goodnight, G. Lewin, D. Luebke, and K. Skadron. A multigrid solver for boundary value problems
using programmable graphics hardware. In HWWS ’03: Proceedings of the ACM SIGGRAPH/EURO-
GRAPHICS conference on Graphics hardware, pages 102–111, Aire-la-Ville, Switzerland, Switzerland,
2003. Eurographics Association. ISBN 1-58113-739-7.
[39] O. Green, R. McColl, and D. A. Bader. Gpu merge path: A gpu merging algorithm. In Proceedings of
the 26th ACM International Conference on Supercomputing, ICS ’12, pages 331–340, New York, NY,
USA, 2012. ACM. ISBN 978-1-4503-1316-2. doi: 10.1145/2304576.2304621. URL http://doi.acm.
org/10.1145/2304576.2304621.
[40] F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition.
ACM Trans. Math. Softw., 4:250–269, September 1978. ISSN 0098-3500. doi: http://doi.acm.org/10.
1145/355791.355796. URL http://doi.acm.org/10.1145/355791.355796.
[41] G. Haase, M. Liebmann, G. Plank, and C. Douglas. Parallel algebraic multigrid on general purpose
gpus. In J. V. et al, editor, 3rd Austrian Grid Symposium, pages 28–37, 2010.
[42] Z. He and B. Hong. Dynamically tuned push-relabel algorithm for the maximum flow problem on
cpu-gpu-hybrid platforms. In IPDPS, pages 1–10. IEEE, 2010. URL http://dblp.uni-trier.de/db/
conf/ipps/ipdps2010.html#HeH10.
[43] V. E. Henson and U. M. Yang. Boomeramg: A parallel algebraic multigrid solver and preconditioner. Ap-
plied Numerical Mathematics, 41(1):155 – 177, 2002. ISSN 0168-9274. doi: DOI:10.1016/S0168-9274(01)
00115-5. URL http://www.sciencedirect.com/science/article/pii/S0168927401001155.
[44] M. Heroux, R. Bartlett, V. H. R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski,
E. Phipps, A. Salinger, H. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An Overview of
Trilinos. Technical Report SAND2003-2927, Sandia National Laboratories, 2003.
[45] J. Hoberock and N. Bell. Thrust: A parallel template library, 2011. URL http://www.meganewtons.
com/. Version 1.4.0.
[46] S. Kaniel. Estimates for some computational techniques in linear algebra. Math. Comp., 20:369–378,
1966.
[47] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.
SIAM J. Sci. Comput., 20(1):359–392, Dec. 1998. ISSN 1064-8275. doi: 10.1137/S1064827595287997.
URL http://dx.doi.org/10.1137/S1064827595287997.
107
[48] G. Karypis and V. Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM
Review, 41(2):278–300, 1999.
[49] M. Kazhdan and H. Hoppe. Streaming multigrid for gradient-domain operations on large images. In
ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pages 21:1–21:10, New York, NY, USA, 2008. ACM.
ISBN 978-1-4503-0112-1. doi: http://doi.acm.org/10.1145/1399504.1360620. URL http://doi.acm.
org/10.1145/1399504.1360620.
[50] B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell system
technical journal, 49(1):291–307, 1970.
[51] A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned
conjugate gradient method. SIAM J. Sci. Comput, 23:517–541, 2001.
[52] J. Kuczynski and H. Wozniakowski. Estimating the largest eigenvalue by the power and lanczos algo-
rithms with a random start, 1992.
[53] J. Kurzak, S. Tomov, and J. Dongarra. Autotuning gemms for fermi , 2011.
[54] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for
Fortran Usage. ACM Transactions on Mathematical Software (TOMS), 5(3):308–323, 1979.
[55] M. Lu, L. zhu Zhang, and F. Tian. Lower bounds of the laplacian spectrum of graphs based on
diameter. Linear Algebra and its Applications, 420(23):400 – 406, 2007. ISSN 0024-3795. doi: http://
dx.doi.org/10.1016/j.laa.2006.07.023. URL http://www.sciencedirect.com/science/article/pii/
S0024379506003478.
[56] M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput., 15
(4):1036–1055, 1986. ISSN 0097-5397. doi: http://dx.doi.org/10.1137/0215074.
[57] D. Merrill. Cub : Reusable software for cuda programming. http://nvlabs.github.io/cub/, 2014.
Version 1.3.2.
[58] D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CS2009-14,
University of Virginia, Department of Computer Science, Charlottesville, VA, USA, 2009.
[59] D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of imple-
menting dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02):245–272, 2011.
ISSN 0129-6264. doi: 10.1142/S0129626411000187. URL http://www.worldscinet.com/ppl/21/
2102/S0129626411000187.html.
[60] D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN
symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pages 117–128, New York,
NY, USA, 2012. ACM.
[61] D. G. Merrill and A. S. Grimshaw. Revisiting sorting for gpgpu stream architectures. Technical Report
CS2010-03, University of Virginia, Department of Computer Science, Charlottesville, VA, USA, 2010.
[62] R. Merris. A note on laplacian graph eigenvalues. Linear Algebra and its Applications, 285(13):33
– 35, 1998. ISSN 0024-3795. doi: http://dx.doi.org/10.1016/S0024-3795(98)10148-9. URL http:
//www.sciencedirect.com/science/article/pii/S0024379598101489.
[63] H. Nguyen. Gpu gems 3. Addison-Wesley Professional, first edition, 2007. ISBN 9780321545428.
[64] Nvidia. CUSPARSE : Users guide. http://developer.nvidia.com/cusparse.
[65] TESLA C2050/C2070 GPU Computing Processor Supercomputing at 1/10th the Cost. NVIDIA Corpo-
ration, July 2010. www.nvidia.com/docs/IO/43395/NV DS Tesla C2050 C2070 jul10 lores.pdf.
[66] NVIDIA CUDA Programming Guide. NVIDIA Corporation, May 2011. URL http://www.nvidia.
com/CUDA. Version 4.0.
108
[67] NVIDIA CUDA Programming Guide. NVIDIA Corporation, Dec. 2012. URL http://www.nvidia.
com/CUDA. Version 5.0.
[68] L. N. Olson, J. Schroder, and R. S. Tuminaro. A new perspective on strength measures in algebraic
multigrid. Numerical Linear Algebra with Applications, 17(4):713–733, 2010. ISSN 1099-1506. doi:
10.1002/nla.669. URL http://dx.doi.org/10.1002/nla.669.
[69] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs.
SIAM J. Matrix Anal. Appl., 11(3):430–452, May 1990. ISSN 0895-4798. doi: 10.1137/0611030. URL
http://dx.doi.org/10.1137/0611030.
[70] M. O. Rabin and V. V. Vazirani. Maximum matchings in general graphs through randomization.
Journal of Algorithms, 10(4):557 – 567, 1989. ISSN 0196-6774. doi: 10.1016/0196-6774(89)90005-9.
URL http://www.sciencedirect.com/science/article/pii/0196677489900059.
[71] J. W. Ruge and K. Stu¨ben. Algebraic multigrid. In Multigrid methods, volume 3 of Frontiers Appl.
Math., pages 73–130. SIAM, Philadelphia, PA, 1987.
[72] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 2nd edition, 2003. ISBN 0898715342.
[73] I. Safro, P. Sanders, and C. Schulz. Advanced coarsening schemes for graph partitioning. In SEA, pages
369–380, 2012.
[74] A. Sen, H. Deng, and S. Guha. On a graph partitioning problem with applications to vlsi layout. In
Circuits and Systems, 1991., IEEE International Sympoisum on, pages 2846–2849 vol.5, Jun 1991. doi:
10.1109/ISCAS.1991.176137.
[75] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics
Hardware 2007, pages 97–106. ACM, Aug. 2007.
[76] L. Shi. Bounds on the (laplacian) spectral radius of graphs. Linear Algebra and its Applications,
422(23):755 – 770, 2007. ISSN 0024-3795. doi: http://dx.doi.org/10.1016/j.laa.2006.12.003. URL
http://www.sciencedirect.com/science/article/pii/S0024379506005349.
[77] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation
approach to graph-partitioning. J. of Global Optimization, 29(2):225–241, June 2004. ISSN 0925-5001.
doi: 10.1023/B:JOGO.0000042115.44455.f3. URL http://dx.doi.org/10.1023/B:JOGO.0000042115.
44455.f3.
[78] D. A. Spielman and S.-H. Teng. Spectral partitioning works: Planar graphs and finite element meshes.
In In IEEE Symposium on Foundations of Computer Science, pages 96–105, 1996.
[79] M. Stu¨rmer, H. Ko¨stler, and U. Ru¨de. How to optimize geometric multigrid methods on gpus. In 15th
Copper Mountain Conference on Multigrid Methods, March 2011.
[80] R. S. Tuminaro and C. Tong. Parallel smoothed aggregation multigrid : Aggregation strategies on mas-
sively parallel machines. SC Conference, 0:5, 2000. ISSN 1063-9535. doi: http://doi.ieeecomputersociety.
org/10.1109/SC.2000.10008.
[81] S. Tzeng and L.-Y. Wei. Parallel white noise generation on a gpu via cryptographic hash. In Proceedings
of the 2008 symposium on Interactive 3D graphics and games, pages 79–87. ACM, 2008.
[82] P. Vaneˇk, J. Mandel, and M. Brezina. Algebraic Multigrid by Smoothed Aggregation for Second and
Fourth Order Elliptic Problems. Computing, 56(3):179–196, 1996.
[83] C. Vasconcelos and B. Rosenhahn. Bipartite graph matching computation on gpu. In D. Cremers,
Y. Boykov, A. Blake, and F. Schmidt, editors, Energy Minimization Methods in Computer Vision and
Pattern Recognition, volume 5681 of Lecture Notes in Computer Science, pages 42–55. Springer Berlin
/ Heidelberg, 2009. ISBN 978-3-642-03640-8.
109
[84] R. W. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, 2003. AAI3121741.
[85] R. W. Vuduc and H.-J. Moon. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block
Structure. In High Performance Computing And Communications: First International Conference,
HPCC 2005, Sorrento, Italy, September 21-23, 2005: Proceedings. Springer, 2005.
[86] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-
vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009.
ISSN 0167-8191. doi: 10.1016/j.parco.2008.12.006. URL http://www.sciencedirect.com/science/
article/pii/S0167819108001403. Revolutionary Technologies for Acceleration of Emerging Petascale
Applications.
[87] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for
multicore architectures. Commun. ACM, 52(4):65–76, Apr. 2009. ISSN 0001-0782. doi: 10.1145/
1498765.1498785. URL http://doi.acm.org/10.1145/1498765.1498785.
[88] F. Zafar, M. Olano, and A. Curtis. Gpu random numbers via the tiny encryption algorithm. In
Proceedings of the Conference on High Performance Graphics, pages 133–141. Eurographics Association,
2010.
110
