A lightweight optimization selection method for Sparse Matrix-Vector
  Multiplication by Elafrou, Athena et al.
A lightweight optimization selection method
for Sparse Matrix-Vector Multiplication
Athena Elafrou Georgios Goumas Nectarios Koziris
National Technical University of Athens
{athena,goumas,nkoziris}@cslab.ece.ntua.gr
Abstract
In this paper, we propose an optimization selection method-
ology for the ubiquitous sparse matrix-vector multiplication
(SpMV) kernel. We propose two models that attempt to iden-
tify the major performance bottleneck of the kernel for every
instance of the problem and then select an appropriate opti-
mization to tackle it. Our first model requires online profil-
ing of the input matrix in order to detect its most prevail-
ing performance issue, while our second model only uses
comprehensive structural features of the sparse matrix. Our
method delivers high performance stability for SpMV across
different platforms and sparse matrices, due to its applica-
tion and architecture awareness. Our experimental results
demonstrate that a) our approach is able to distinguish and
appropriately optimize special matrices in multicore plat-
forms that fall out of the standard class of memory band-
width bound matrices, and b) lead to a significant perfor-
mance gain of 29% in a manycore platform compared to an
architecture-centric optimization, as a result of the success-
ful selection of the appropriate optimization for the great ma-
jority of the matrices. With a runtime overhead equivalent to
a couple dozen SpMV operations, our approach is practi-
cal for use in iterative numerical solvers of real-life applica-
tions.
Categories and Subject Descriptors J-2 [Computer Appli-
cations]: Physical Sciences and Engineering; G-1-3 [Nu-
merical Linear Algebra]
General Terms Algorithms, Performance
Keywords Sparse Matrices, Sparse Matrix-Vector Multi-
plication, Machine Learning, SpMV
[Copyright notice will appear here once ’preprint’ option is removed.]
1. Introduction
The ubiquitous sparse matrix-vector multiplication (SpMV)
kernel is a fundamental building block of popular iterative
methods for the solution of sparse linear systems (Ax =
b), and the approximation of eigenvalues and eigenvectors
of sparse matrices (Ax = λx). Such problems arise in
a diverse set of application domains, including large-scale
simulations of physical processes using multi-physics and
multi-disciplinary approaches, information retrieval, med-
ical imaging, economic modeling and others. Optimizing
SpMV has always been a challenging task due to a num-
ber of inherent performance limitations, as a result of the
algorithmic nature of the kernel, the employed sparse matrix
storage format and the sparsity pattern of the matrix. SpMV
is characterized by a very low flop:byte ratio, since the kernel
performs O(n2) operations on O(n2) amount of data, indi-
rect memory references as a result of storing the matrix in a
compressed format, irregular memory accesses to the source
vector due to sparsity, and loop overheads in the presence of
a large amount of very short rows in the matrix [7].
Most optimization efforts proposed in the literature over
the past few decades [1, 3, 9, 10, 14, 16, 20, 21, 23, 24]
have focused either on a subset of the aforementioned per-
formance issues or a single hardware platform and are, con-
sequently, neither portable nor provide stable performance
gains even within a single hardware platform. This is one
of the reasons why such optimizations are difficult to adopt
in real-life applications. Another reason concerns the non-
negligible runtime overhead that usually accompanies such
optimizations. This overhead may include format conver-
sion, format parameter tuning, reordering the matrix, etc.
If the numerical solver does not have enough iterations to
amortize this cost, the benefit of applying the optimization
may be outweighed. To make matters worse, one might in-
cur the overhead without any guarantee of a performance
improvement. Based on all the previous observations, we
have come to believe that the next solid step for improving
SpMV performance no longer involves proposing new and
expensive optimizations, but applying the plethora of avail-
able optimizations whenever they can be effective, i.e., trans-
forming the challenge of developing new optimizations to se-
1
ar
X
iv
:1
51
1.
02
49
4v
3 
 [c
s.P
F]
  1
1 J
an
 20
16
lecting an appropriate optimization for the target matrix and
architecture. We claim that a “universal” optimization effort
for SpMV is feasible, and, towards this direction, we pro-
pose a methodology to automatically select an efficient opti-
mization for SpMV, that is both application and architecture
aware. Our approach seeks to incorporate the following key
characteristics:
• stable: Optimization selection should provide perfor-
mance improvements for all sparse matrices.
• cross-platform: Optimization selection should be benefi-
cial on every architecture.
• lightweight: The runtime overhead of optimization selec-
tion should be low in order for it to be applicable in real-
life scenarios.
In order to provide performance stability, our method-
ology attempts to detect the major performance bottleneck
of SpMV for the input matrix. We formulate the decision
making as a classification problem in Section 3.1, assum-
ing classes represent performance bottlenecks. We then de-
velop a profiling-based classifier in Section 3.2.1, that re-
lies on micro-benchmarks to classify the matrix. Architec-
tural characteristics are implicitly deduced through these
micro-benchmarks, making this classifier architecture inde-
pendent. As the online-profiling phase has a non-negligible
cost, which may outweigh the optimization benefit, we go
one step beyond, and propose in Section 3.2.2 a classi-
fier that relies only on structural features of the input ma-
trix, avoiding any online, profiling-based information. This
feature-based classifier is pre-trained during an offline stage
with the use of machine learning techniques, and only per-
forms feature extraction on-the-fly. The runtime overhead
of this classifier is very low, equivalent to only a couple of
dozen SpMV operations, making it extremely lightweight.
Once the prevailing performance bottleneck of a matrix has
been detected, we accordingly apply an optimization that
could tackle it. Although we employ a simple and easy-to-
implement set of optimizations, our approach is compatible
to any optimization able to tackle the specific performance
bottlenecks. We have tested our method on one multi-core
(Intel Sandy Bridge) and one many-core (Intel Xeon Phi) ar-
chitecture for a wide set of candidate matrices. Our experi-
mental results demonstrate that a) on the multicore platform,
our approach is able to distinguish and appropriately opti-
mize special matrices that fall out of the standard class of
memory bandwidth bound matrices, and b) on the manycore
platform, our approach leads to a significant performance
gain of 29% compared to an architecture-centric optimiza-
tion, achieved by the successful selection of the appropriate
optimization for the great majority of the matrices.
2. Background
In order to avoid the extra computation and storage over-
heads imposed by the large majority of zero elements con-
.
.7:5 .2:9 .2:8 .2:7 .0 .0
.6:8 .5:7 .3:8 .0 .0 .0
.2:4 .6:2 .3:2 .0 .0 .0
.9:7 .0 .0 .2:3 .0 .0
.0 .0 .0 .0 .5:8 .5:0
.0 .0 .0 .0 .6:6 .8:1
0BBBBBBB@
1CCCCCCCAA =
.rowptr: . . . . .( .0 .4 .7 .10 .12 .14 .16 .) . . . . .
.colind: .( .0 .1 .2 .3 .0 .1 .2 .0 .1 .2 .0 .3 .4 .5 .4 .5 .)
.val: .( .7:5 .2:9 .2:8 .2:7 .6:8 .5:7 .3:8 .2:4 .6:2 .3:2 .9:7 .2:3 .5:8 .5:0 .6:6 .8:1 .)
Figure 1: The Compressed Sparse Row (CSR) sparse matrix
storage format.
tained in a sparse matrix, it has been the norm to store the
nonzero elements of the matrix contiguously in memory and
employ auxiliary data structures for the proper traversal of
the matrix and vector elements. The most widely used for-
mat, namely the Compressed Sparse Row (CSR) format [18],
uses a row pointer structure to index the start of each row
within the array of nonzero elements, and a column index
structure to store the column of each nonzero element. An
example of this format is given in Figure 1. The SpMV ker-
nel using this format is given in Algorithm 1.
Algorithm 1 SpMV computation using the CSR format.
1: procedure SPMV(A::in, x::in, y::out)
A: matrix in CSR format
x: input vector
y: output vector
2: for i← 1, N do
3: for j ← rowptr[i], rowptr[i+ 1] do
4: y[i]← val[j] ·x[colind[j]]
5: end for
6: end for
7: end procedure
Examining Algorithm 1, we notice three potential perfor-
mance issues for SpMV:
1. indirect memory references: This is the most apparent
implication of sparsity. Since we only want to store the
nonzero elements of the matrix, we need auxiliary in-
dexing structures to access them from memory. For the
CSR format we use the colind and rowptr data struc-
tures. Indexing, however, introduces additional load op-
erations, traffic for the memory subsystem, and cache in-
terference [16].
2. irregular memory accesses to vector x: Access to vec-
tor x in sparse matrices is irregular and depends on the
sparsity structure of the matrix. This fact complicates the
process of exploiting any spatial or temporal reuse in the
access to vector x [6, 15].
3. short row lengths: Many sparse matrices contain a large
number of rows with short lengths. This fact may degrade
2
performance due to the overhead incurred by the small
trip count of the inner loop [14].
3. Optimization Selection Methodology
Depending on the sparsity pattern of the matrix and the un-
derlying architecture, a suitable optimization for SpMV may
vary due to varying performance issues. Thus, SpMV could
benefit from an optimization selection process. The benefits
of such an approach are two-fold: firstly, performance can
be optimized for all problem instances, and, secondly, we
can leverage previous research on SpMV that has generally
focused on different instances of the problem. The solution
to the optimization selection problem will enable the devel-
opment of frameworks that will intelligently select the opti-
mal optimization for a particular input matrix, thus providing
performance stability for the SpMV kernel.
3.1 Formulation as a Classification Problem
The optimization selection problem can be solved in various
ways. One could simply take an empirical approach: mea-
sure how different optimizations work for a particular ma-
trix on the target machine and then apply the most efficient
optimization on future runs. However, this would require a
heavy profiling phase with a non-negligible runtime cost,
that could outweigh any benefit from optimizing SpMV. We
select a more elegant and lightweight approach to solve the
optimization selection problem. We formulate it as a clas-
sification problem, by assuming that every matrix belongs
to a single class, representing its major performance bottle-
neck. For every class, we assign a corresponding optimiza-
tion that attempts to tackle the specific bottleneck. Given an
input sparse matrix, its class is predicted and the correspond-
ing optimization is applied. In this context, we define the
following classes:
• CML: This class refers to matrices that suffer from ex-
cessive LLC misses and can therefore be bound by cache
miss latencies. The source of these misses can be deter-
mined by examining the SpMV kernel in Algorithm 1.
Accesses to the matrix elements (val) and indexing data
structures (rowptr, colind) show a regular streaming
behavior, which can be easily detected and prefetched
by hardware and, thus, are of limited concern to us.
However, accesses to the input vector are performed
through indirect indexing, which requires special hard-
ware prefetching mechanisms not available in current
architectures. Thus, if the sparsity pattern of the matrix
is very irregular, latency can become a bottleneck for
SpMV.
• MB: This class includes matrices that have saturated
the available memory bandwidth. This is the dominat-
ing class for SpMV on modern multicore architectures
due to its very low flop:byte ratio.
• IMB: This class appears mostly on many-core architec-
tures, where the large number of threads exposes highly
uneven row lengths in the matrix or regions with com-
pletely different sparsity patterns, resulting either in
workload imbalance or computational unevenness.
• CMP: This class includes matrices that are bound by
computation. Such matrices are mostly matrices that fit
in the system’s caches and are, therefore, not limited by
main memory bandwidth.
The above classes are quite generic, as they serve our
initial goal, which is to categorize a matrix based on its
prevailing performance bottleneck. Each class covers a wide
variety of matrices with different sparsity patterns. Thus,
one could further sub-categorize each class, based on more
distinguishing structural features. This would serve a more
elaborate optimization selection scheme and is left for future
work.
Instead of defining our own classes we could have used
cluster analysis to discover groups of matrices that are sim-
ilar in some manner. However, it seemed more intuitive to
leverage our own experience and the extensive research that
has been realized over the past few decades on SpMV op-
timization, and define our own classes. A “blind” machine
learning approach is not necessary in this case, since the ma-
jor performance bottlenecks of the kernel have already been
identified, granting us a basic understanding of the problem.
Another option would be to directly define optimizations as
classes. Every class would simply represent the best opti-
mization for its group of matrices. However, in this scenario,
we would have to associate features of a matrix to every op-
timization in our search space instead of every performance
bottleneck, which could be more error-prone. Also, this ap-
proach cannot lead to a cross-platform decision making, as
it is optimization dependent and optimizations can be com-
pletely different from one hardware platform to another.
3.2 Classifiers
We follow two approaches to solve our classification prob-
lem and guide the optimization selection for SpMV. The first
approach involves a hand-tuned classification algorithm that
relies on performance data collected during an online pro-
filing phase of the input matrix. We will henceforth refer to
this classifier as the profiling-based classifier. The second
approach leverages a machine learning technique to train a
classifier on a predefined set of matrices during an offline
stage and only requires a small number of structural features
to be extracted from the input matrix on-the-fly. This is the
feature-based classifier.
3.2.1 Profiling-Based Classifier
In this approach, the class of a matrix is determined through
a series of micro-benchmarks that are executed on the in-
put matrix during an online profiling phase. These bench-
marks attempt to implicitly extract the impact of the archi-
3
tectural characteristics of the underlying hardware platform
on SpMV performance. We define three benchmarks that run
a CSR-based SpMV kernel with some modification:
• noxmiss: This benchmark tries to eliminate irregular
memory accesses to vector x, by setting the column in-
dices of all nonzero elements to zero through the colind
array of CSR. Since irregularity results in cache misses,
it is indicative of matrices that belong to the CML class.
• inflate: This benchmark uses 64-bit indices for the rowptr
and colind arrays of CSR in order to increase the work-
ing set size of SpMV (the size of the indexing structures
is actually doubled). We expect this benchmark to indi-
cate which matrices are hindered by an increase in the
working set size, and, consequently, have a high proba-
bility of being memory bandwidth bound.
• balance: This benchmark measures the execution time of
all individual threads and reports the arithmetic mean.
This metric is indicative of the performance that could
be reclaimed by balancing the load. We must note here
that our default partitioning scheme for SpMV is static
one-dimensional row partitioning, with an equal (or as
close to equal as possible) number of nonzero elements
per partition.
Each benchmark reports SpMV execution time. For a
given input matrix, we run all three benchmarks along with
the baseline CSR-based implementation and compute the
speedup of each benchmark over the baseline CSR, with
the exception of inflate, for which we compute the inverse.
The classification is performed by evaluating the impact of
each benchmark on the baseline implementation. First, we
select the benchmark with the highest speedup and, if it ex-
ceeds some threshold, we classify the matrix accordingly.
If not, we examine the benchmark with the second highest
speedup and so on and so forth. If none of the three bench-
marks has a noteworthy impact on performance, then the ma-
trix is labeled as CMP. The thresholds have been empirically
set for the time being taking into account the ability of a
micro-benchmark to provide a naive or realistic bound on the
speedup that can be achieved for SpMV by tackling the cor-
responding problem. For example, the noxmiss benchmark,
which completely eliminates irregularity in the matrix, pro-
vides a naive performance bound and, thus, requires a strict
(high) threshold.
3.2.2 Feature-Based Classifier
We also use a machine learning approach to build a classifier.
The advantage of machine learning in this case is that, given
a set of classes, the classification rules can be automatically
deduced based on a training data set, and, subsequently, be
used for predictions, requiring no profiling executions.
We experiment with a Decision Tree classifier and a
Naive Bayes classifier and employ supervised learning al-
gorithms to generate them. The classifiers are built based
on features extracted from the matrix structure. Decision
Tree learning is performed using an optimized version of the
CART algorithm, which has a runtime cost of O(nfeatures
nsamples log nsamples) for the construction of the tree,
where nfeatures is the number of features used and nsamples
is the number of samples used to build the classifier. The
Naive Bayes classifier is created using Maximum A Poste-
riori (MAP) training with a Gaussian distribution assump-
tion, which has a time complexity of O(nfeaturesnsamples).
The query time for these classifiers is O(log nsamples) and
O(nclassesnfeatures) respectively. We generate both classi-
fiers using the scikit-learn machine learning toolkit.
Feature Extraction
This classifier uses real-valued features to perform the clas-
sification. Table 1 shows all the features we experimented
with, along with the time complexity of their extraction from
the input matrix. Most features are parameters that repre-
sent comprehensive characteristics of a sparse matrix and
are closely related to SpMV performance. For example, the
number of nonzero elements per row can indicate load im-
balance during SpMV execution. If a matrix contains very
uneven row lengths and a row-partitioning scheme is be-
ing employed for workload distribution, then threads that
are assigned long rows will take longer to execute result-
ing in thread imbalance. Uneven row lengths can easily be
detected through the corresponding metrics in Table 1 (see
nnz{min,max,avg,sd}). Similarly, irregularity in the accesses
to the input vector, which can lead to excessive cache misses,
can be estimated by examining how spread out the elements
in every row are (see dispersion{avg,sd}).
Training Data Selection
We train our classifiers with a data set consisting of sparse
matrices from the University of Florida Sparse Matrix Col-
lection [5]. We have selected a matrix suite consisting of
115 matrices from a wide variety of application domains,
to avoid being biased towards a specific sparsity pattern. We
must note here that we are examining all matrix sizes, not
only matrices exceeding the system’s LLC size. Also, we
have decided not to balance the training set in terms of class
representation, as that would require redefining the training
set on every hardware platform (as all classes are not equally
represented on each platform), which is not desirable for a
cross-platform approach. We did, nevertheless, experiment
with balanced training sets, but the improvement in the ac-
curacy of our classifiers was not significant enough to justify
imposing such a limitation.
Training Data Labeling
Labeling refers to the process of assigning a class to each
matrix that will be used in the training process of our clas-
sifier. Since the class of a matrix cannot be determined in
a straightforward manner, we use our profiling-based classi-
fier for this purpose. An issue that arises by this choice, and
4
Feature Definition Complexity
size 0:exceeds or 1:fits in LLC Θ(1)
density NNZN∗M Θ(1)
nnzmin min{nnz1, . . . , nnzN} Θ(N)
nnzmax max{nnz1, . . . , nnzN} Θ(N)
nnzavg
1
N
∑N
i=1 nnzi Θ(N)
nnzsd
√
1
N
∑N
i=1(nnzi − nnzavg)2 Θ(2N)
bwmin min{bw1, . . . , bwN} Θ(N)
bwmax max{bw1, . . . , bwN} Θ(N)
bwavg
1
N
∑N
i=1 bwi Θ(N)
bwsd
√
1
N
∑N
i=1(bwi − bwavg)2 Θ(2N)
dispersionavg
1
N
∑N
i=1 dispi Θ(N)
dispersionsd
√
1
N
∑N
i=1(dispi − dispersionavg)2 Θ(2N)
clustering 1N
∑N
i=1 clusti Θ(NNZ)
miss ratio 1N
∑N
i=1missesi Θ(NNZ)
Table 1: Sparse matrix features used for classification. N
denotes the number of rows in the matrix, M the number of
columns and NNZ the number of nonzero elements. nnzi
is the number of nonzero elements of row i, bwi the column
distance between the first and last nonzero element of row
i, dispi = nnzidistancei , clusti =
ngroupsi
nnzi
, where ngroupsi is
the number of groups formed by consecutive elements in row
i and, finally, missesi is the number of nonzero elements in
row i that can generate cache misses. We naively say that an
element will generate a cache miss when its distance from
the previous element in the same row exceeds the cache line
size of the system.
cannot be ignored, is the validity of the labelings, as it af-
fects the accuracy of the trained classifier. Since the decision
rules in the profiling-based classifier have been empirically
set, a matrix can be mislabeled if it falls on the boundaries
that separate classes.
3.3 Optimization Pool
A multitude of optimization techniques and specialized for-
mats have been proposed in the literature for improving the
performance of SpMV. Most of them require a preprocess-
ing phase, either to modify the sparsity pattern of the matrix
itself or to generate new data structures for the representa-
tion of the matrix, that result in higher compression ratios
of its memory footprint, better load balancing of the work-
load among threads, etc. This preprocessing overhead needs
to remain low, as it may offset the benefit of applying the
optimization, in particular when only tens of iterations are
required for a numerical solver to converge.
As most legacy code in scientific applications uses the
standard CSR format for sparse computations, we decided,
for the sake of simplicity and in order to minimize the pre-
processing costs, to incorporate only CSR-based optimiza-
tions. We currently experiment with a single optimization
per class, given in Table 2. However, our methodology is not
tied to specific optimizations and can be easily extended to
incorporate any optimization proposed in the literature, as
long as it is guaranteed to be effective for the corresponding
class.
Class Optimization
CML software prefetching on vector x
MB column index compression through delta coding [17]
IMB auto or dynamic scheduling (OpenMP)
CMP inner loop unrolling + vectorization
Table 2: Mapping of matrix classes to targeted optimiza-
tions.
For the CML class of matrices, we employ one of the
most important techniques for tolerating increasing cache
miss latencies, e.g., prefetching. Since the source of exces-
sive cache misses in SpMV is a result of indirect memory
addressing on the input vector, which cannot be efficiently
tackled by current hardware prefetching mechanisms, soft-
ware prefetching was applied. A single prefetch instruction
was inserted in the inner loop of SpMV, with a fixed prefetch
distance (although it can be tuned on a matrix basis for
a higher performance gain). The prefetch distance was set
equal to the number of elements that fit in a single cache
line of the hardware platform, assuming double-precision
floating-point values. The intuition behind this choice is that
for very irregular matrices, there will be limited to no spa-
tial locality for accesses to x in a single row, thus requir-
ing a prefetch instruction per element in order to hide the
latency. The data is prefetched in the L1 cache. For matri-
ces that are memory-bandwidth bound, we employ a rather
simple compression scheme, just for illustrative purposes.
We use delta indexing on the column indices of the nonzero
elements of the matrix, a technique which was originally
applied on SpMV by Pooch and Nieder [17]. We use 8-
or 16-bit deltas wherever possible, but never both, in order
to limit the branching overhead during SpMV computation.
For matrices that suffer from load imbalance we employ ei-
ther the dynamic or auto scheduling policies available in
the OpenMP runtime system [4]. When the auto schedule
is specified, the decision regarding scheduling is delegated
to the compiler, which has the freedom to choose any possi-
ble mapping of iterations (rows in this case) to threads. Fi-
nally, for compute-bound matrices we combine inner loop
unrolling with vectorization.
4. Experimental Evaluation
Our experimental evaluation focuses on two aspects: matrix
classification accuracy and the overall benefit of applying
our methodology for optimizing SpMV.
4.1 Experimental Setup and Methodology
Our experiments are performed on two hardware platforms,
including an Intel Xeon Phi coprocessor (codename Xeon
Phi) and a cache-coherent non-uniform memory access (cc-
NUMA) multiprocessor system. The NUMA system is a
four-way eight-core Intel Xeon E5-4620 configuration (co-
dename Sandy Bridge-EP). In the following, we will refer to
5
Codename Xeon Phi Sandy Bridge-EP
Model Intel Xeon Phi 3120P Intel Xeon E5-4620
Microarchitecture Intel Many Integrated Core Intel Sandy Bridge
Clock frequency 1.10 GHz 2.26 GHz
L1 cache (D/I) 32 KiB/32 KiB 32 KiB/32 KiB
L2 cache 512 KiB (per core) 256 KiB (per core)
L3 cache - 16 MiB
Cores/Threads 57/228 8/16
Multiprocessor configurations
Processors 1 4
Cores/Threads 57/228 32/64
Sustained memory b/w 119 GiB/s 4×13.5 GiB/s
Table 3: Technical characteristics of the hardware plat-
forms used for the experimental evaluations. The sustained
memory bandwidth figures are obtained with the STREAM
benchmark [13] utilizing the full system.
each platform using its codename. Table 3 lists the technical
specifications of our test platforms in more detail.
Both systems were running a 64-bit version of the Linux
OS. For the compilation of the software involved in our
performance evaluation, we used ICC-15.0.0, while we used
the OpenMP parallel programming API. To simulate the
typical sparse matrix storage case, we used 64-bit, double
precision floating point values for the nonzero elements.
4.2 Classifier Accuracy
First, we evaluate our feature-based classifier in terms of ac-
curacy, assuming the labels generated by the profiling-based
classifier are correct. The profiling-based classifier is not el-
igible to such an analysis, as we currently have no way of
accurately deciding upon the major performance bottleneck
of a matrix and, thus, correctly evaluating this classifier. Us-
ing optimizations to determine the label of a matrix would
require more than one optimization per class in order to be
safe, so we do not currently follow this approach. However,
we do provide later on some insight into how the classi-
fiers perform compared to the best among the employed op-
timizations for every matrix.
We estimate how accurately our models will perform in
practice using Leave-One-Out cross validation. According
to this methodology, for a training set of k matrices (115
in our case), k experiments are performed. For each experi-
ment k−1 matrices are used for training and one for testing.
The final accuracy reported by the whole process is the frac-
tion of k matrices for which the classifier correctly predicts
their class, i.e. the prediction matches the original labeling
of the matrix. This metric is important, as it determines the
suitability of the selected optimization for the target matrix.
However, it is not decisive, contrary to standard classifica-
tion problems, in the sense that it depends on the correct la-
beling of each matrix, which is not straightforward in some
cases. There exist matrices that have competing performance
bottlenecks, and, thus, their labeling is ambiguous. This fact,
actually, allows us to tolerate a less accurate labeling algo-
Classifier Feature sets Complexity Accuracy(%)
Decision Tree size, bw{avg,sd}
nnz{min,max,avg,sd}
miss ratio
dispersionsd
O(NNZ) 77
Naive Bayes nnz{min,max,sd}
bwavg
dispersion{avg,sd}
O(N) 75
Table 4: Feature-based classifiers for Xeon Phi.
Classifier Feature sets Complexity Accuracy(%)
Decision Tree size, bw{avg,sd}
nnz{min,max,avg,sd}
dispersionsd
miss ratio
O(NNZ) 84
Naive Bayes size, nnz{min,max} O(N) 83
Table 5: Feature-based classifiers for Sandy Bridge-EP.
rithm, such as our profiling-based classifier. It also means
that our classifier might improve the performance of SpMV
even in the case of a “misprediction”, as long as this mis-
prediction corresponds to another performance bottleneck.
Thus, achieving the highest possible accuracy is desirable,
but not of the highest priority.
Tables 4 and 5 give an overview of the final feature-based
classifiers for each hardware platform under consideration
along with the time complexity of the feature extraction and
the achieved accuracy. The selection of features for these
classifiers has been a result of exhaustive search. Decision
Tree classifiers seem to work slightly better for our prob-
lem, with a maximum accuracy of 77% on Xeon Phi and
84% on Sandy Bridge-EP. However, the Naive Bayes clas-
sifiers manage to attain competitive accuracies using fewer
features.
4.3 Optimization Selection Performance
Contrary to standard classification problems, our key mea-
sure of success is not classifier accuracy—rather, we aim
to maximize the average performance improvement of op-
timization selection over pre-selecting the most straightfor-
ward optimization on each architecture. In this way, we des-
ignate the value of optimization selection for SpMV. For
our multi-core architecture (Sandy Bridge-EP) we consider
column index compression (see Table 2) to be the straight-
forward optimization technique, while on our many-core ar-
chitecture (Xeon Phi), which is characterized by very wide
SIMD units, we consider vectorization to be the straightfor-
ward optimization.
Figure 2 presents the raw SpMV performance achieved
by the architecture-centric optimization, as defined previ-
ously, the best optimization among our pool of optimiza-
tions, and the optimization selected by each of our classi-
fiers. The labeling of each matrix according to the profiling-
based classifier is also provided. We note that predictions
from the feature-based classifier are obtained once again
through Leave-One-Out cross-validation, as described in the
6
previous subsection. On each platform we use the feature-
based classifier with the highest accuracy reported in Ta-
bles 4 and 5. Our initial observation concerns the diversity
of performance bottlenecks on our experimental platforms,
demonstrated by the labeling of the matrices. On Xeon Phi,
all classes are equally represented, unlike Sandy Bridge-EP,
which is dominated by matrices belonging to the MB class,
as expected. Our classifiers manage to successfully capture
this trend due to their architecture awareness.
Taking a closer look at Figure 2, we notice that the
profiling-based classifier does not always match the best
optimization, indicating that it might be misclassifying ma-
trices. However, this is not always the case, as some matrices
may be correctly classified, but the corresponding optimiza-
tion is not as effective. This is the case with Ga3As3H12 and
torso1 on Xeon Phi. On this platform, mislabelings happen
mostly with matrices that are competing between belonging
to the CML or IMB class. Nonetheless, in some cases, e.g.,
Hamrle3, Rucci1, torso1, Ga41As41H72, rajat31, inline 1,
our feature-based classifier manages to select a better opti-
mization, even though it has been trained with labels gener-
ated by the profiling-based classifier. On Sandy Bridge-EP,
where the majority of the matrices belong to the MB class,
both of our classifiers manage to optimize most of the corner
cases, e.g., cfd2, gupta2, rma10, offshore, etc. On this plat-
form, most matrices not belonging to the MB class fit in the
aggregate cache of the system, thus exposing weaknesses in
the computational part of SpMV.
To get an overall evaluation of our approach, for each
matrix we define the following metrics:
• ideal classifier speedup: the speedup provided by apply-
ing the best optimization amongst those under consider-
ation
• profiling-based classifier speedup: the speedup provided
by applying the optimization selected by the profiling-
based classifier
• feature-based classifier speedup: the speedup provided
by applying the optimization selected by the feature-
based classifier
Using the unoptimized CSR-based SpMV implementa-
tion as the baseline, for each of the aforementioned metrics
we plot the average, min and max values over the entire ma-
trix set, as well as the interquartile range. i.e., the range be-
tween the first and third quartiles, which represents the 25%-
75% range of the values. Figure 2 presents the corresponding
results on each platform in the form of a box plot. The lower
and upper whiskers correspond to the min and max values,
while the bottom and top of the box are the first and third
quartiles respectively. The circle corresponds to the average.
On Xeon Phi, the importance of optimization selec-
tion is more prominent, providing an average speedup of
1.31× with the profiling-based classifier and 1.30× with the
feature-based classifier, reaching 94.2% and 93.5% respec-
tively of the best possible gain with the applied optimiza-
tions. This is mainly due to the architectural characteristics
of the Xeon Phi coprocessor, which result in a quite bal-
anced representation of each class. The coprocessor has a
large amount of cores, which favor workload imbalance, and
a very expensive (an order of magnitude bigger compared to
multi-cores) cache miss latency, which further exposes ir-
regularity in the accesses to the input vector. Thus, there are
many matrices that exceed the LLC size and are not mem-
ory bandwidth bound. On the other hand, Sandy Bridge-EP
is dominated by memory bandwidth bound matrices and
can, therefore, only benefit from the correct classification of
matrices belonging to minority classes. This makes higher
accuracy rates more important on this architecture. Overall,
our classifiers manage to optimize most of those matrices,
leading to a 1.19× and 1.20× average speedup respectively
and adding a 3% and 4% improvement over the architecture-
centric optimization. Even though there are some misclassi-
fications, they can be partially tolerated by the fact that the
architecture-centric optimization can also be “harmful” for
some matrices.
In total, the potential of an optimization selection ap-
proach for SpMV can be better estimated if we assume we
have a “best” classifier, i.e., a classifier that always selects
the best optimization presented in Figure 2, and compare it
to an “architecture-centric” optimizer. In the box plots pro-
vided, we see that the “best” classifier improves SpMV per-
formance over the “architecture-centric” by 37% on Xeon
Phi and 9% on Sandy Bridge-EP on average.
It is important to point out that the achieved performance
improvements presented in this work are limited by the opti-
mizations we have selected. As this is not an actual optimiza-
tion framework, we selected optimizations that can be easily
implemented. Higher performance gains can be attained by
a more elaborate examination of the wide variety of opti-
mizations found in the literature and their association to our
matrix classes. Selecting an even more suitable optimization
after the matrix has been classified, based on its distinguish-
ing structural features, is left for future work.
4.4 Runtime overhead
Our profiling-based optimization selection approach com-
prises two steps: running a number of micro-benchmarks
on the input matrix and then applying our empirically-tuned
classification algorithm to select the appropriate optimiza-
tion. On the other hand, the online stage of the feature-based
optimization selection methodology comes down to extract-
ing the required features from the input matrix and then pre-
dicting its class using the pre-trained classifier.
The runtime overhead of both classification approaches
was measured in multithreaded CSR SpMV operations and
is presented in Table 6, i.e., we report the ratio TclassificationTSpMV ,
where Tclassification is the execution time of the classi-
fication and TSpMV the execution time of a single mul-
7
fid
ap
01
1
fl2
01
0
la
ng
ua
ge
cf
d2
ca
20
10
ca
ge
12
gu
pt
a2
rm
a
10
po
iss
on
3D
b
tx
20
10
AS
IC
_6
80
ks
he
lm
2d
03
th
er
m
om
ec
h_
dK
G
a3
As
3H
12
st
om
ac
h
sm
e
3D
c
e
co
lo
gy
2
w
e
bb
as
e-
1M
FE
M
_3
D_
th
er
m
al
2
pa
ra
bo
lic
_f
em
la
m
in
ar
_d
uc
t3
D
xe
n
o
n
2
ba
rri
er
2-
9
CO
AS
IC
_6
80
k
iC
he
m
_J
ac
ob
ia
n
G
e9
9H
10
0
o
ffs
ho
re
th
re
ad
to
rs
o3
G
a1
9A
s1
9H
42
sh
ip
_0
01
ge
ar
bo
x
s3
dk
q4
m
2
a
pa
ch
e2
m
o
n
o
_
50
0H
z
cr
a
n
ks
eg
_1
la
rg
eb
as
is
Si
O
2
H
am
rle
3
co
n
sp
h
pr
e2
TS
O
PF
_F
S_
b3
00
_c
3
Be
nE
le
ch
i1
bm
w7
st
_1
ca
ge
13
Si
41
G
e4
1H
72
co
n
t1
_l
G
3_
cir
cu
it
de
gm
e
PR
02
R
R
uc
ci
1
to
rs
o1
TS
O
PF
_R
S_
b6
78
_c
2
th
er
m
al
2
hu
m
an
_g
en
e2
a
f_
sh
el
l10
4
8
12
16
20
24
28
32
G
flo
p/
s
CMP
CMLCML
CMP
CML
CML
CMP
CMP
CMLCML
CML
CMLCML
IMB
CMP
CML
CML
CML
MB
CML
IMB
CMP
IMB
CML
IMB
CMLIMB
CML
MB
CML
IMB
MB MB
MB
CML
CML
CMP
CMPIMB
IMB
CMP
CML
IMB
MB MB
CML
IMB
IMB
IMB
IMB
CMP
CMP
IMBIMB
CML
CMP
MB
architecture-centric optimization
best optimization
profiling-based classifier
feature-based classifier
a
tm
os
m
od
j
m
_
t1
m
sd
oo
r
bu
nd
le
_a
dj
bm
wc
ra
_1
Si
87
H7
6
a
tm
os
m
od
m
ho
od
o
hn
e2
ra
il4
28
4
bm
w3
_2
gs
m
_1
06
85
7
ki
m
2
pw
tk
hu
m
an
_g
en
e1 JP
cr
a
n
ks
eg
_2
n
d1
2k
n
d2
4k
m
o
u
se
_
ge
ne
n
lp
kk
t8
0
Cu
rlC
ur
l_
4
kk
t_
po
we
r
m
e
m
ch
ip
TS
O
PF
_R
S_
b2
38
3
a
f_
5_
k1
01
G
a4
1A
s4
1H
72
PF
lo
w
_7
42
Fr
ee
sc
al
e1
w
ik
ip
ed
ia
-2
00
51
10
5
ci
rc
ui
t5
M
_d
c
fe
m
_h
ifr
eq
_c
irc
ui
t
Em
ilia
_9
23
ra
jat
31
Co
up
Co
ns
3D
Tr
an
sp
or
t
Fr
ee
sc
al
e2 F1
a
f_
sh
el
l1
0
Fu
llC
hi
p
ca
ge
14
fd
if2
02
x2
02
x1
02
H
oo
k_
14
98
G
eo
_1
43
8
Se
re
na
in
lin
e_
1
R
M
07
R
G
L7
d1
9
a
u
di
kw
_1
w
ik
ip
ed
ia
-2
00
61
10
4
sp
al
_0
04
ld
oo
r
n
lp
kk
t1
20
bo
ne
S1
0
Bu
m
p_
29
11
ca
ge
15
n
lp
kk
t1
60
n
lp
kk
t2
00
matrix suite
0
4
8
12
16
20
24
28
32
G
flo
p/
s
CML
MB
IMB
MB
MB
CMLCML
IMB
CML
CML
MB
CML
CMP
MB
CML
CML
MB
MB
MB
CML
IMB
MB
CML
CML
IMB
CMP
CML
IMB
CML
CML
CML
CML
CMP
CMP
CMP
CML
CML
CML
MB
IMB
CML
CML
CMP
CMP
CMP
IMB
MB
CML
CMP
CML
CML
CMP
IMB
MB
IMB
CML
IMB
IMB
best
profiling-based
feature-based
architecture-centric
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
R
el
at
iv
e 
pe
rfo
rm
an
ce
 o
ve
r 
u
n
o
pt
im
iz
ed
 C
SR
(a) Xeon Phi (224 threads)
fid
ap
01
1
fl2
01
0
la
ng
ua
ge
cf
d2
ca
20
10
ca
ge
12
gu
pt
a2
rm
a
10
po
iss
on
3D
b
tx
20
10
AS
IC
_6
80
ks
he
lm
2d
03
th
er
m
om
ec
h_
dK
G
a3
As
3H
12
st
om
ac
h
sm
e
3D
c
e
co
lo
gy
2
w
e
bb
as
e-
1M
FE
M
_3
D_
th
er
m
al
2
pa
ra
bo
lic
_f
em
la
m
in
ar
_d
uc
t3
D
xe
n
o
n
2
ba
rri
er
2-
9
CO
AS
IC
_6
80
k
iC
he
m
_J
ac
ob
ia
n
G
e9
9H
10
0
o
ffs
ho
re
th
re
ad
to
rs
o3
G
a1
9A
s1
9H
42
sh
ip
_0
01
ge
ar
bo
x
s3
dk
q4
m
2
a
pa
ch
e2
m
o
n
o
_
50
0H
z
cr
a
n
ks
eg
_1
la
rg
eb
as
is
Si
O
2
H
am
rle
3
co
n
sp
h
pr
e2
TS
O
PF
_F
S_
b3
00
_c
3
Be
nE
le
ch
i1
bm
w7
st
_1
ca
ge
13
Si
41
G
e4
1H
72
co
n
t1
_l
G
3_
cir
cu
it
de
gm
e
PR
02
R
R
uc
ci
1
to
rs
o1
TS
O
PF
_R
S_
b6
78
_c
2
th
er
m
al
2
hu
m
an
_g
en
e2
a
f_
sh
el
l10
6
12
18
24
30
36
42
48
54
G
flo
p/
s
CMP
CMP
CMP
CMP
IMB
MB
IMB
IMB
CML
MB
CMP
CMLMB
MB
CMP
MB
MB
MB
MB
CML
MB
CMP
CMPMB
IMB
MB
MB
CMP
MB
MB MB
MB
CMP
MB
MB MB
MB
MB MB
CML
MB
MB
IMB
MB
MB MB MB MB MB
CMLMB MB
MB MB
MB
MB MB
architecture-centric optimization
best optimization
profiling-based classifier
feature-based classifier
a
tm
os
m
od
j
m
_
t1
m
sd
oo
r
bu
nd
le
_a
dj
bm
wc
ra
_1
Si
87
H7
6
a
tm
os
m
od
m
ho
od
o
hn
e2
ra
il4
28
4
bm
w3
_2
gs
m
_1
06
85
7
ki
m
2
pw
tk
hu
m
an
_g
en
e1 JP
cr
a
n
ks
eg
_2
n
d1
2k
n
d2
4k
m
o
u
se
_
ge
ne
n
lp
kk
t8
0
Cu
rlC
ur
l_
4
kk
t_
po
we
r
m
e
m
ch
ip
TS
O
PF
_R
S_
b2
38
3
a
f_
5_
k1
01
G
a4
1A
s4
1H
72
PF
lo
w
_7
42
Fr
ee
sc
al
e1
w
ik
ip
ed
ia
-2
00
51
10
5
ci
rc
ui
t5
M
_d
c
fe
m
_h
ifr
eq
_c
irc
ui
t
Em
ilia
_9
23
ra
jat
31
Co
up
Co
ns
3D
Tr
an
sp
or
t
Fr
ee
sc
al
e2 F1
a
f_
sh
el
l1
0
Fu
llC
hi
p
ca
ge
14
fd
if2
02
x2
02
x1
02
H
oo
k_
14
98
G
eo
_1
43
8
Se
re
na
in
lin
e_
1
R
M
07
R
G
L7
d1
9
a
u
di
kw
_1
w
ik
ip
ed
ia
-2
00
61
10
4
sp
al
_0
04
ld
oo
r
n
lp
kk
t1
20
bo
ne
S1
0
Bu
m
p_
29
11
ca
ge
15
n
lp
kk
t1
60
n
lp
kk
t2
00
matrix suite
0
2
4
6
8
10
G
flo
p/
s
MB
MB
MB MB
MB MB
MB
MB MB
CML
MB MB
MB
MB
MB
MB MB MB MB
MB
MB MB
MB MB
MB
MB MB MB
CML
CMLCML
MB MB
MB
MB
MB
MB
MB
MB
IMB
MB
MB
MB MB MB MB MB
CML
MB
CML
MB
MB
MB
MB MB MB
MB MB
best
profiling-based
feature-based
architecture-centric
0
0.5
1.0
1.5
2.0
2.5
R
el
at
iv
e 
pe
rfo
rm
an
ce
 o
ve
r 
u
n
o
pt
im
iz
ed
 C
SR
(b) Sandy Bridge-EP (32 threads)
Figure 2: Performance results of optimization selection on Xeon Phi and Sandy Bridge-EP. architecture-centric optimization
refers to the optimization corresponding to the CMP class for Xeon Phi and MB class on Sandy Bridge-EP, while best
optimization represents the maximum performance achieved with any of the tested optimizations (see Table 2).
8
profiling-based feature-based
Xeon Phi 462 14
Sandy Bridge-EP 462 16
Table 6: Average runtime overhead of optimization selection
expressed in multi-threaded CSR-based SpMVs, using both
classifiers presented in Section 3.2.
tithreaded SpMV operation. Apparently, the feature-based
classifier has a significant advantage over its profiling-based
alternative, as it requires only a couple dozen operations,
which can be easily amortized in the context of an iterative
numerical solver. We must note here that this overhead does
not include any preprocessing required for applying an op-
timization. Among the optimizations applied in our evalua-
tion, only the one for MB matrices requires preprocessing.
5. Related Work
Different sparse matrices have different sparsity patterns,
and different architectures have different strengths and
weaknesses. In order to achieve the best SpMV performance
for the target sparse matrix on the target platform, an auto-
tuning approach has long been considered to be beneficial.
The first autotuning approaches attempted to tune parame-
ters of specific sparse matrix storage formats. Towards this
direction, the Optimized Sparse Kernel Interface (OSKI) li-
brary [22] was developed as a collection of high performance
sparse matrix operation primitives on single core processors.
It relies on the SPARSITY framework [10] to tune the SpMV
kernel, by applying multiple optimizations, including regis-
ter blocking and cache blocking. Autotuning has also been
used to find the best block and slice sizes of the input sparse
matrix on modern CMPs and GPUs [2].
There have been some research efforts closer to our work.
The clSpMV framework [19] is the first framework that an-
alyzes the input sparse matrix at runtime, and recommends
the best representation of the given sparse matrix, but it is re-
stricted to GPU platforms. Towards the same direction, the
authors in [8] present an analytical and profile-based perfor-
mance modeling to predict the execution time of SpMV on
GPUs using different sparse matrix storage formats, in or-
der the select the most efficient format for the target matrix.
For each format under consideration, they establish a rela-
tionship between the number of nonzero elements per row
in the matrix and the execution time of SpMV using that
format, thus encapsulating to some degree the structure of
the matrix in their methodology. Similarly, in [12], the au-
thors propose a probabilistic model to estimate the execution
time of SpMV on GPUs for different sparse matrix formats.
They define a probability mass function to analyze the spar-
sity pattern of the target matrix and use it to estimate the
compression efficiency of every format they examine. Com-
bined with the hardware parameters of the GPU, they predict
the performance of SpMV for every format. Since compres-
sion efficiency is the determinant factor in this approach,
it is mainly targeted for memory bandwidth bound matri-
ces. Closer to our approach is the SMAT autotuning frame-
work [11]. This framework selects the most efficient format
for the target matrix using feature parameters of the sparse
matrix. It treats the format selection process as a classifi-
cation problem, with each format under consideration rep-
resenting a class, and leverages a data mining approach to
generate a decision tree to perform the classification, based
on the extracted feature parameters of the matrix. The distin-
guishing advantage of our optimization selection methodol-
ogy over the aforementioned approaches, is that it decouples
the decision making from specific optimizations, by predict-
ing the major performance bottleneck of SpMV instead of
SpMV execution time using a specific optimization. Thus,
in contrary to the above frameworks, where incorporating a
new optimization requires either retraining a model or defin-
ing a new one, our decision-making approach allows an au-
totuning framework to be easily extended, simply by assign-
ing the new optimization to one of the classes.
6. Conclusions - Future Work
In this paper we presented a cross-platform methodology for
optimizing the SpMV kernel and establish that, depending
on the sparsity pattern of the matrix and the underlying ar-
chitecture, a suitable optimization can be selected to improve
SpMV performance. We formulate optimization selection
for SpMV as a classification problem, with each class cor-
responding to a performance bottleneck. We first propose a
profiling-based classifier, that relies on benchmarking the in-
put matrix to perform the decision making. We also leverage
machine learning to train Decision Tree and Naive Bayes
classifiers that use comprehensive structural features to per-
form the classification and rely on the profiling-based clas-
sifier for the training process. Experimental evaluation on
115 sparse matrices and two platforms has demonstrated that
our methodology is very promising, especially for the lat-
est many-core architectures, which intensify various perfor-
mance issues of the SpMV kernel.
Concerning future work, another direction that needs to
be explored is whether a multi-level optimization approach
would be beneficial for SpMV. It is very likely that, once the
major performance issue of a matrix has been successfully
addressed, another bottleneck will be exposed. Actually, the
features that are extracted from the matrix and fed to our
featured-based classifier can also be used in a following
step to select even more targeted optimizations, depending
solely on the sparsity pattern. We also intend to experiment
with other machine learning techniques, in order to further
improve the accuracy of our feature-based classifier. For
example, semi-supervised learning, which allows some of
the training data to be unlabeled, could be helpful, since
9
there are matrices whose class cannot be easily determined.
Finally, we plan to test our approach on GPU platforms.
References
[1] R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high perfor-
mance algorithm using pre-processing for the sparse matrix-
vector multiplication. In Proceedings of the 1992 ACM/IEEE
conference on Supercomputing, pages 32–41. IEEE Computer
Society Press, 1992.
[2] J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven au-
totuning of sparse matrix-vector multiply on gpus. In ACM
Sigplan Notices, volume 45, pages 115–126. ACM, 2010.
[3] E. Cuthill and J. McKee. Reducing the bandwidth of sparse
symmetric matrices. In Proceedings of the 1969 24th national
conference, pages 157–172. ACM, 1969.
[4] L. Dagum and R. Enon. Openmp: an industry standard api
for shared-memory programming. Computational Science &
Engineering, IEEE, 5(1):46–55, 1998.
[5] T. A. Davis and Y. Hu. The university of florida sparse ma-
trix collection. ACM Transactions on Mathematical Software
(TOMS), 38(1):1, 2011.
[6] R. Geus and S. Ro¨llin. Towards a fast parallel sparse matrix-
vector multiplication. In PARCO, pages 308–315. World
Scientific, 1999.
[7] G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and
N. Koziris. Performance evaluation of the sparse matrix-
vector multiplication on modern architectures. The Journal
of Supercomputing, 50(1):36–77, 2009.
[8] P. Guo, L. Wang, and P. Chen. A performance modeling and
optimization analysis tool for sparse matrix-vector multiplica-
tion on gpus. Parallel and Distributed Systems, IEEE Trans-
actions on, 25(5):1112–1123, 2014.
[9] E.-J. Im and K. Yelick. Optimizing sparse matrix computa-
tions for register reuse in sparsity. In Computational Science-
ICCS 2001, pages 127–136. Springer, 2001.
[10] E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization
framework for sparse matrix kernels. International Journal of
High Performance Computing Applications, 18(1):135–158,
2004.
[11] J. Li, G. Tan, M. Chen, and N. Sun. Smat: an input adaptive
auto-tuner for sparse matrix-vector multiplication. In ACM
SIGPLAN Notices, volume 48, pages 117–126. ACM, 2013.
[12] K. Li, W. Yang, and K. Li. Performance analysis and opti-
mization for spmv on gpu using probabilistic modeling. Par-
allel and Distributed Systems, IEEE Transactions on, 26(1):
196–205, 2015.
[13] J. D. McCalpin. Stream: Sustainable memory bandwidth in
high performance computers, 1995.
[14] J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix–
vector product computations using unroll and jam. Interna-
tional Journal of High Performance Computing Applications,
18(2):225–236, 2004.
[15] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera.
Improving the locality of the sparse matrix-vector product on
shared memory multiprocessors. In Parallel, Distributed and
Network-Based Processing, 2004. Proceedings. 12th Euromi-
cro Conference on, pages 66–71. IEEE, 2004.
[16] A. Pinar and M. T. Heath. Improving performance of sparse
matrix-vector multiplication. In Proceedings of the 1999
ACM/IEEE conference on Supercomputing, page 30. ACM,
1999.
[17] U. W. Pooch and A. Nieder. A survey of indexing techniques
for sparse matrices. ACM Computing Surveys (CSUR), 5(2):
109–133, 1973.
[18] Y. Saad. Numerical methods for large eigenvalue problems,
volume 158. SIAM, 1992.
[19] B.-Y. Su and K. Keutzer. clspmv: A cross-platform opencl
spmv framework on gpus. In Proceedings of the 26th ACM
international conference on Supercomputing, pages 353–364.
ACM, 2012.
[20] O. Temam and W. Jalby. Characterizing the behavior of
sparse algorithms on caches. In Proceedings of the 1992
ACM/IEEE conference on Supercomputing, pages 578–587.
IEEE Computer Society Press, 1992.
[21] S. Toledo. Improving the memory-system performance of
sparse-matrix vector multiplication. IBM Journal of research
and development, 41(6):711–725, 1997.
[22] R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library
of automatically tuned sparse matrix kernels. In Journal
of Physics: Conference Series, volume 16, page 521. IOP
Publishing, 2005.
[23] R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector
multiplication by exploiting variable block structure. In High
Performance Computing and Communications, pages 807–
816. Springer, 2005.
[24] J. Willcock and A. Lumsdaine. Accelerating sparse ma-
trix computations via data compression. In Proceedings of
the 20th annual international conference on Supercomputing,
pages 307–316. ACM, 2006.
10
