77,500 research outputs found

    Cache locality exploiting methods and models for sparse matrix-vector multiplication

    Get PDF
    Ankara : The Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 52-56.The sparse matrix-vector multiplication (SpMxV) is an important kernel operation widely used in linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers to solve a system of linear equations. High performance gains can be obtained if we can take the advantage of today’s deep cache hierarchy in SpMxV operations. Matrices with irregular sparsity patterns make it difficult to utilize data locality effectively in SpMxV computations. Different techniques are proposed in the literature to utilize cache hierarchy effectively via exploiting data locality during SpMxV. In this work, we investigate two distinct frameworks for cacheaware/oblivious SpMxV: single matrix-vector multiply and multiple submatrix-vector multiplies. For the single matrix-vector multiply framework, we propose a cache-size aware top-down row/column-reordering approach based on 1D sparse matrix partitioning by utilizing the recently proposed appropriate hypergraph models of sparse matrices, and a cache oblivious bottom-up approach based on hierarchical clustering of rows/columns with similar sparsity patterns. We also propose a column compression scheme as a preprocessing step which makes these two approaches cache-line-size aware. The multiple submatrix-vector multiplies framework depends on the partitioning the matrix into multiple nonzero-disjoint submatrices. For an effective matrixto-submatrix partitioning required in this framework, we propose a cache-size aware top-down approach based on 2D sparse matrix partitioning by utilizing the recently proposed fine-grain hypergraph model. For this framework, we also propose a traveling salesman formulation for an effective ordering of individual submatrix-vector multiply operations. We evaluate the validity of our models and methods on a wide range of sparse matrices. Experimental results show that proposed methods and models outperforms state-of-the-art schemes.Akbudak, KadirM.S

    Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

    Get PDF
    The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rearrangement of data in order to reduce memory traffic, code optimization at all levels, designing parallel algorithms to reduce span are some of the attractive areas that HPC researchers are working on. In this thesis, we investigate HPC techniques for the implementation of basic routines in computer algebra targeting hardware acceleration technologies. We start with a sorting algorithm and its application to sparse matrix-vector multiplication for which we focus on work on cache complexity issues. Since basic routines in computer algebra often provide a lot of fine grain parallelism, we then turn our attention to manycore architectures on which we consider dense polynomial and matrix operations ranging from plain to fast arithmetic. Most of these operations are combined within a bivariate system solver running entirely on a graphics processing unit (GPU)

    Thermal Evolution of Planetesimals and Protoplanets in the Terrestrial Planet Region: Code Optimization and Implementation on a Distributed Grid using NetSolve

    Get PDF
    A code for asteroidal heat transfer and growth is optimized for performance. The Gauss elimination routine for the solver is replaced by a sparse matrix routine. Finite element matrix assembly operations are rewritten to reduce operations involving 3D arrays to 1D. Advantage is taken of the sparse matrix structure of finite element matrices in reducing 2D arrays to 1D. The number of vector touches are reduced to the extent possible, by carrying over statements from one iteration to the next. The number of do loops are reduced by merging several do loops into one. The optimization reduced the CPU time taken to run the code from 297 sec to 0.88 sec for a matrix size of 100, an improvement of 99.70%. More importantly, the algorithm was reduced from a O(n3) operation to a O(n) operation. Thus, the percent time difference between the optimized and unoptimized versions is greater at larger matrix sizes. At matrix sizes of 100, the number of floating point operations were reduced from 2.39 E+09 to 2.99E+07, an improvement of 98.75% and the performance was increased by about 4 times, from 8.06 MFLOPS/s to 33.92 MFLOPS/s. Because of inefficiency in memory allocation, the maximum matrix size for the unoptimized code was limited to 200. This was increased to 5,000,000 for the optimized code. A version of the code was implemented on NetSolve and added to the list of problems on netsolve.cs.utk.edu. Two sample movies were generated using OpenGL to explain the scientific significance of the code. With the implementation of the optimized code, applications to address scientific problems can now be envisioned that were previously thought to be prohibitive in terms of computer time

    Solving Sparse Integer Linear Systems

    Get PDF
    We propose a new algorithm to solve sparse linear systems of equations over the integers. This algorithm is based on a pp-adic lifting technique combined with the use of block matrices with structured blocks. It achieves a sub-cubic complexity in terms of machine operations subject to a conjecture on the effectiveness of certain sparse projections. A LinBox-based implementation of this algorithm is demonstrated, and emphasizes the practical benefits of this new method over the previous state of the art

    Adapting the interior point method for the solution of LPs on serial, coarse grain parallel and massively parallel computers

    Get PDF
    In this paper we describe a unified scheme for implementing an interior point algorithm (IPM) over a range of computer architectures. In the inner iteration of the IPM a search direction is computed using Newton's method. Computationally this involves solving a sparse symmetric positive definite (SSPD) system of equations. The choice of direct and indirect methods for the solution of this system, and the design of data structures to take advantage of serial, coarse grain parallel and massively parallel computer architectures, are considered in detail. We put forward arguments as to why integration of the system within a sparse simplex solver is important and outline how the system is designed to achieve this integration

    GraphR: Accelerating Graph Processing Using ReRAM

    Full text link
    This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suit- able for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x) speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes 4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is 3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
    corecore