108 research outputs found

    Analyzing and enhancing OSKI for sparse matrix-vector multiplication

    Get PDF
    Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers. Matrices with irregular sparsity patterns make it difficult to utilize cache locality effectively in SpMxV computations. In this work, we investigate single- and multiple-SpMxV frameworks for exploiting cache locality in SpMxV computations. For the single-SpMxV framework, we propose two cache-size-aware top-down row/column-reordering methods based on 1D and 2D sparse matrix partitioning by utilizing the column-net and enhancing the row-column-net hypergraph models of sparse matrices. The multiple-SpMxV framework depends on splitting a given matrix into a sum of multiple nonzero-disjoint matrices so that the SpMxV operation is performed as a sequence of multiple input- and output-dependent SpMxV operations. For an effective matrix splitting required in this framework, we propose a cache-size-aware top-down approach based on 2D sparse matrix partitioning by utilizing the row-column-net hypergraph model. The primary objective in all of the three methods is to maximize the exploitation of temporal locality. We evaluate the validity of our models and methods on a wide range of sparse matrices by performing actual runs through using OSKI. Experimental results show that proposed methods and models outperform state-of-the-art schemes.Comment: arXiv admin note: substantial text overlap with arXiv:1202.385

    Cache locality exploiting methods and models for sparse matrix-vector multiplication

    Get PDF
    Ankara : The Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 52-56.The sparse matrix-vector multiplication (SpMxV) is an important kernel operation widely used in linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers to solve a system of linear equations. High performance gains can be obtained if we can take the advantage of today’s deep cache hierarchy in SpMxV operations. Matrices with irregular sparsity patterns make it difficult to utilize data locality effectively in SpMxV computations. Different techniques are proposed in the literature to utilize cache hierarchy effectively via exploiting data locality during SpMxV. In this work, we investigate two distinct frameworks for cacheaware/oblivious SpMxV: single matrix-vector multiply and multiple submatrix-vector multiplies. For the single matrix-vector multiply framework, we propose a cache-size aware top-down row/column-reordering approach based on 1D sparse matrix partitioning by utilizing the recently proposed appropriate hypergraph models of sparse matrices, and a cache oblivious bottom-up approach based on hierarchical clustering of rows/columns with similar sparsity patterns. We also propose a column compression scheme as a preprocessing step which makes these two approaches cache-line-size aware. The multiple submatrix-vector multiplies framework depends on the partitioning the matrix into multiple nonzero-disjoint submatrices. For an effective matrixto-submatrix partitioning required in this framework, we propose a cache-size aware top-down approach based on 2D sparse matrix partitioning by utilizing the recently proposed fine-grain hypergraph model. For this framework, we also propose a traveling salesman formulation for an effective ordering of individual submatrix-vector multiply operations. We evaluate the validity of our models and methods on a wide range of sparse matrices. Experimental results show that proposed methods and models outperforms state-of-the-art schemes.Akbudak, KadirM.S

    Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures

    Get PDF
    Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form C=A,B on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix A to evenly partition the work across threads with the objective of reducing the number of B-matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for B-matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for C-matrix rows. A similarity graph model is proposed for B-matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations. © 1990-2012 IEEE

    Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

    Get PDF
    Sparse matrix-vector and matrix-transpose-vector multiplication (SpMMTV) repeatedly performed as z ← ATx and y ← A z (or y ← A w) for the same sparse matrix A is a kernel operation widely used in various iterative solvers. One important optimization for serial SpMMTV is reusing A-matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of SpMMTV that reuses A-matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of SpMMTV that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing A-matrix nonzeros and intermediate z-vector entries; exploiting locality in accessing x -, y -, and -vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of A. We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing A-matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods. © 2015 IEEE

    Parallel structurally-symmetric sparse matrix-vector products on multi-core processors

    Full text link
    We consider the problem of developing an efficient multi-threaded implementation of the matrix-vector multiplication algorithm for sparse matrices with structural symmetry. Matrices are stored using the compressed sparse row-column format (CSRC), designed for profiting from the symmetric non-zero pattern observed in global finite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the first one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. We analyze how to perform this accumulation in four different ways. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although incurring an increase in the working set size, the former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo

    Algorithms for Large-Scale Sparse Tensor Factorization

    Get PDF
    University of Minnesota Ph.D. dissertation. April 2019. Major: Computer Science. Advisor: George Karypis. 1 computer file (PDF); xiv, 153 pages.Tensor factorization is a technique for analyzing data that features interactions of data along three or more axes, or modes. Many fields such as retail, health analytics, and cybersecurity utilize tensor factorization to gain useful insights and make better decisions. The tensors that arise in these domains are increasingly large, sparse, and high dimensional. Factoring these tensors is computationally expensive, if not infeasible. The ubiquity of multi-core processors and large-scale clusters motivates the development of scalable parallel algorithms to facilitate these computations. However, sparse tensor factorizations often achieve only a small fraction of potential performance due to challenges including data-dependent parallelism and memory accesses, high memory consumption, and frequent fine-grained synchronizations among compute cores. This thesis presents a collection of algorithms for factoring sparse tensors on modern parallel architectures. This work is focused on developing algorithms that are scalable while being memory- and operation-efficient. We address a number of challenges across various forms of tensor factorizations and emphasize results on large, real-world datasets

    Partitioning sparse deep neural networks for scalable training and inference

    Get PDF
    The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly

    A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication

    Get PDF
    The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the Roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring

    Scalable Graph Convolutional Network Training on Distributed-Memory Systems

    Full text link
    Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.Comment: To appear in PVLDB'2
    corecore