108 research outputs found
Analyzing and enhancing OSKI for sparse matrix-vector multiplication
Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used
in iterative linear solvers. The same sparse matrix is multiplied by a dense
vector repeatedly in these solvers. Matrices with irregular sparsity patterns
make it difficult to utilize cache locality effectively in SpMxV computations.
In this work, we investigate single- and multiple-SpMxV frameworks for
exploiting cache locality in SpMxV computations. For the single-SpMxV
framework, we propose two cache-size-aware top-down row/column-reordering
methods based on 1D and 2D sparse matrix partitioning by utilizing the
column-net and enhancing the row-column-net hypergraph models of sparse
matrices. The multiple-SpMxV framework depends on splitting a given matrix into
a sum of multiple nonzero-disjoint matrices so that the SpMxV operation is
performed as a sequence of multiple input- and output-dependent SpMxV
operations. For an effective matrix splitting required in this framework, we
propose a cache-size-aware top-down approach based on 2D sparse matrix
partitioning by utilizing the row-column-net hypergraph model. The primary
objective in all of the three methods is to maximize the exploitation of
temporal locality. We evaluate the validity of our models and methods on a wide
range of sparse matrices by performing actual runs through using OSKI.
Experimental results show that proposed methods and models outperform
state-of-the-art schemes.Comment: arXiv admin note: substantial text overlap with arXiv:1202.385
Cache locality exploiting methods and models for sparse matrix-vector multiplication
Ankara : The Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 52-56.The sparse matrix-vector multiplication (SpMxV) is an important kernel operation
widely used in linear solvers. The same sparse matrix is multiplied by a dense vector
repeatedly in these solvers to solve a system of linear equations. High performance
gains can be obtained if we can take the advantage of today’s deep cache hierarchy
in SpMxV operations. Matrices with irregular sparsity patterns make it difficult to
utilize data locality effectively in SpMxV computations. Different techniques are proposed
in the literature to utilize cache hierarchy effectively via exploiting data locality
during SpMxV. In this work, we investigate two distinct frameworks for cacheaware/oblivious
SpMxV: single matrix-vector multiply and multiple submatrix-vector
multiplies. For the single matrix-vector multiply framework, we propose a cache-size
aware top-down row/column-reordering approach based on 1D sparse matrix partitioning
by utilizing the recently proposed appropriate hypergraph models of sparse
matrices, and a cache oblivious bottom-up approach based on hierarchical clustering
of rows/columns with similar sparsity patterns. We also propose a column compression
scheme as a preprocessing step which makes these two approaches cache-line-size
aware. The multiple submatrix-vector multiplies framework depends on the partitioning
the matrix into multiple nonzero-disjoint submatrices. For an effective matrixto-submatrix
partitioning required in this framework, we propose a cache-size aware
top-down approach based on 2D sparse matrix partitioning by utilizing the recently
proposed fine-grain hypergraph model. For this framework, we also propose a traveling
salesman formulation for an effective ordering of individual submatrix-vector
multiply operations. We evaluate the validity of our models and methods on a wide
range of sparse matrices. Experimental results show that proposed methods and models
outperforms state-of-the-art schemes.Akbudak, KadirM.S
Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures
Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form C=A,B on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix A to evenly partition the work across threads with the objective of reducing the number of B-matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for B-matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for C-matrix rows. A similarity graph model is proposed for B-matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations. © 1990-2012 IEEE
Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors
Sparse matrix-vector and matrix-transpose-vector multiplication (SpMMTV) repeatedly performed as z ← ATx and y ← A z (or y ← A w) for the same sparse matrix A is a kernel operation widely used in various iterative solvers. One important optimization for serial SpMMTV is reusing A-matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of SpMMTV that reuses A-matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of SpMMTV that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing A-matrix nonzeros and intermediate z-vector entries; exploiting locality in accessing x -, y -, and -vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of A. We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing A-matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods. © 2015 IEEE
Parallel structurally-symmetric sparse matrix-vector products on multi-core processors
We consider the problem of developing an efficient multi-threaded
implementation of the matrix-vector multiplication algorithm for sparse
matrices with structural symmetry. Matrices are stored using the compressed
sparse row-column format (CSRC), designed for profiting from the symmetric
non-zero pattern observed in global finite element matrices. Unlike classical
compressed storage formats, performing the sparse matrix-vector product using
the CSRC requires thread-safe access to the destination vector. To avoid race
conditions, we have implemented two partitioning strategies. In the first one,
each thread allocates an array for storing its contributions, which are later
combined in an accumulation step. We analyze how to perform this accumulation
in four different ways. The second strategy employs a coloring algorithm for
grouping rows that can be concurrently processed by threads. Our results
indicate that, although incurring an increase in the working set size, the
former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo
Algorithms for Large-Scale Sparse Tensor Factorization
University of Minnesota Ph.D. dissertation. April 2019. Major: Computer Science. Advisor: George Karypis. 1 computer file (PDF); xiv, 153 pages.Tensor factorization is a technique for analyzing data that features interactions of data along three or more axes, or modes. Many fields such as retail, health analytics, and cybersecurity utilize tensor factorization to gain useful insights and make better decisions. The tensors that arise in these domains are increasingly large, sparse, and high dimensional. Factoring these tensors is computationally expensive, if not infeasible. The ubiquity of multi-core processors and large-scale clusters motivates the development of scalable parallel algorithms to facilitate these computations. However, sparse tensor factorizations often achieve only a small fraction of potential performance due to challenges including data-dependent parallelism and memory accesses, high memory consumption, and frequent fine-grained synchronizations among compute cores. This thesis presents a collection of algorithms for factoring sparse tensors on modern parallel architectures. This work is focused on developing algorithms that are scalable while being memory- and operation-efficient. We address a number of challenges across various forms of tensor factorizations and emphasize results on large, real-world datasets
Partitioning sparse deep neural networks for scalable training and inference
The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly
A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication
The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the Roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring
Scalable Graph Convolutional Network Training on Distributed-Memory Systems
Graph Convolutional Networks (GCNs) are extensively utilized for deep
learning on graphs. The large data sizes of graphs and their vertex features
make scalable training algorithms and distributed memory systems necessary.
Since the convolution operation on graphs induces irregular memory access
patterns, designing a memory- and communication-efficient parallel algorithm
for GCN training poses unique challenges. We propose a highly parallel training
algorithm that scales to large processor counts. In our solution, the large
adjacency and vertex-feature matrices are partitioned among processors. We
exploit the vertex-partitioning of the graph to use non-blocking point-to-point
communication operations between processors for better scalability. To further
minimize the parallelization overheads, we introduce a sparse matrix
partitioning scheme based on a hypergraph partitioning model for full-batch
training. We also propose a novel stochastic hypergraph model to encode the
expected communication volume in mini-batch training. We show the merits of the
hypergraph model, previously unexplored for GCN training, over the standard
graph partitioning model which does not accurately encode the communication
costs. Experiments performed on real-world graph datasets demonstrate that the
proposed algorithms achieve considerable speedups over alternative solutions.
The optimizations achieved on communication costs become even more pronounced
at high scalability with many processors. The performance benefits are
preserved in deeper GCNs having more layers as well as on billion-scale graphs.Comment: To appear in PVLDB'2
- …