77,500 research outputs found
Recommended from our members
The scheduling of sparse matrix-vector multiplication on a massively parallel dap computer
An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP). This approach seeks to reduce the inter-processor data movements and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer.
The resulting data structure is of particular relevance to iterative schemes for solving linear systems. Performance results for matrices taken from well known Linear Programming (LP) test problems are presented and analysed
Cache locality exploiting methods and models for sparse matrix-vector multiplication
Ankara : The Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 52-56.The sparse matrix-vector multiplication (SpMxV) is an important kernel operation
widely used in linear solvers. The same sparse matrix is multiplied by a dense vector
repeatedly in these solvers to solve a system of linear equations. High performance
gains can be obtained if we can take the advantage of today’s deep cache hierarchy
in SpMxV operations. Matrices with irregular sparsity patterns make it difficult to
utilize data locality effectively in SpMxV computations. Different techniques are proposed
in the literature to utilize cache hierarchy effectively via exploiting data locality
during SpMxV. In this work, we investigate two distinct frameworks for cacheaware/oblivious
SpMxV: single matrix-vector multiply and multiple submatrix-vector
multiplies. For the single matrix-vector multiply framework, we propose a cache-size
aware top-down row/column-reordering approach based on 1D sparse matrix partitioning
by utilizing the recently proposed appropriate hypergraph models of sparse
matrices, and a cache oblivious bottom-up approach based on hierarchical clustering
of rows/columns with similar sparsity patterns. We also propose a column compression
scheme as a preprocessing step which makes these two approaches cache-line-size
aware. The multiple submatrix-vector multiplies framework depends on the partitioning
the matrix into multiple nonzero-disjoint submatrices. For an effective matrixto-submatrix
partitioning required in this framework, we propose a cache-size aware
top-down approach based on 2D sparse matrix partitioning by utilizing the recently
proposed fine-grain hypergraph model. For this framework, we also propose a traveling
salesman formulation for an effective ordering of individual submatrix-vector
multiply operations. We evaluate the validity of our models and methods on a wide
range of sparse matrices. Experimental results show that proposed methods and models
outperforms state-of-the-art schemes.Akbudak, KadirM.S
Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact
The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rearrangement of data in order to reduce memory traffic, code optimization at all levels, designing parallel algorithms to reduce span are some of the attractive areas that HPC researchers are working on. In this thesis, we investigate HPC techniques for the implementation of basic routines in computer algebra targeting hardware acceleration technologies. We start with a sorting algorithm and its application to sparse matrix-vector multiplication for which we focus on work on cache complexity issues. Since basic routines in computer algebra often provide a lot of fine grain parallelism, we then turn our attention to manycore architectures on which we consider dense polynomial and matrix operations ranging from plain to fast arithmetic. Most of these operations are combined within a bivariate system solver running entirely on a graphics processing unit (GPU)
Thermal Evolution of Planetesimals and Protoplanets in the Terrestrial Planet Region: Code Optimization and Implementation on a Distributed Grid using NetSolve
A code for asteroidal heat transfer and growth is optimized for performance. The Gauss elimination routine for the solver is replaced by a sparse matrix routine. Finite element matrix assembly operations are rewritten to reduce operations involving 3D arrays to 1D. Advantage is taken of the sparse matrix structure of finite element matrices in reducing 2D arrays to 1D. The number of vector touches are reduced to the extent possible, by carrying over statements from one iteration to the next. The number of do loops are reduced by merging several do loops into one. The optimization reduced the CPU time taken to run the code from 297 sec to 0.88 sec for a matrix size of 100, an improvement of 99.70%. More importantly, the algorithm was reduced from a O(n3) operation to a O(n) operation. Thus, the percent time difference between the optimized and unoptimized versions is greater at larger matrix sizes. At matrix sizes of 100, the number of floating point operations were reduced from 2.39 E+09 to 2.99E+07, an improvement of 98.75% and the performance was increased by about 4 times, from 8.06 MFLOPS/s to 33.92 MFLOPS/s. Because of inefficiency in memory allocation, the maximum matrix size for the unoptimized code was limited to 200. This was increased to 5,000,000 for the optimized code. A version of the code was implemented on NetSolve and added to the list of problems on netsolve.cs.utk.edu. Two sample movies were generated using OpenGL to explain the scientific significance of the code. With the implementation of the optimized code, applications to address scientific problems can now be envisioned that were previously thought to be prohibitive in terms of computer time
Recommended from our members
Solving large scale linear programming problems
The interior point method (IPM) is now well established as a computationaly com-petitive scheme for solving very large scale linear programming problems. The leading variant of the IPM is the primal dual predictor corrector algorithm due to Mehrotra. The main computational efforts in this algorithm are the repeated calculation and solution of a large sparse positive definite system of equations.
We describe an implementation of this algorithm for vector processors. At the heart of the implementation is a vectorized matrix multiplication and Cholesky factorization for sparse matrices.
We identify the parts where vectorization can be beneficial and discuss in details the merits of alternative vectorization techniques. We show that the best way to utilize a vector processor is by exploiting dense computation within the sparse framework and by unrolling loop operations. We further present an extended definition of supernodes, and describe an implementation based on this new approach. We show that although this approach requires more memory it can increase the scope of dense computation substantially with out adding extra operations.
Performance results on standard industrial test problems and comparison between an algorithm that utilizes the extended supernodes and one that utilizes standard supernodes are presented and discussed
Solving Sparse Integer Linear Systems
We propose a new algorithm to solve sparse linear systems of equations over
the integers. This algorithm is based on a -adic lifting technique combined
with the use of block matrices with structured blocks. It achieves a sub-cubic
complexity in terms of machine operations subject to a conjecture on the
effectiveness of certain sparse projections. A LinBox-based implementation of
this algorithm is demonstrated, and emphasizes the practical benefits of this
new method over the previous state of the art
Adapting the interior point method for the solution of LPs on serial, coarse grain parallel and massively parallel computers
In this paper we describe a unified scheme for implementing an interior point algorithm (IPM) over a range of computer architectures. In the inner iteration of the IPM a search direction is computed using Newton's method. Computationally this involves solving a sparse symmetric positive definite (SSPD) system of equations. The choice of direct and indirect methods for the solution of this system, and the design of data structures to take advantage of serial, coarse grain parallel and massively parallel computer architectures, are considered in detail. We put forward arguments as to why integration of the system within a sparse simplex solver is important and outline how the system is designed to achieve this integration
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
- …