Search CORE

6,199 research outputs found

A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization

Author: Ghysels Pieter
Li Xiaoye S.
Napov Artem
Rouet François-Henry
Publication venue
Publication date: 26/06/2015
Field of study

We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices appear in many applications, e.g., finite element methods, boundary element methods, etc. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. This work is part of a more global effort, the STRUMPACK (STRUctured Matrices PACKage) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver

arXiv.org e-Print Archive

eScholarship - University of California

DI-fusion

Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

Author: Li Ang
Negrut Dan
Serban Radu
Publication venue
Publication date: 25/09/2015
Field of study

We discuss an approach for solving sparse or dense banded linear systems

{\bf A} {\bf x} = {\bf b}

on a Graphics Processing Unit (GPU) card. The matrix

{\bf A} \in {\mathbb{R}}^{N \times N}

is possibly nonsymmetric and moderately large; i.e.,

10000 \leq N \leq 500000

. The ${\it split\ and\ parallelize}

(

{\tt SaP}

) approach seeks to partition the matrix

{\bf A}

into diagonal sub-blocks

{\bf A}_i

,

i=1,\ldots,P

, which are independently factored in parallel. The solution may choose to consider or to ignore the matrices that couple the diagonal sub-blocks

{\bf A}_i

. This approach, along with the Krylov subspace-based iterative method that it preconditions, are implemented in a solver called

{\tt SaP::GPU}

, which is compared in terms of efficiency with three commonly used sparse direct solvers:

{\tt PARDISO}

,

{\tt SuperLU}

, and

{\tt MUMPS}

.

{\tt SaP::GPU}

, which runs entirely on the GPU except several stages involved in preliminary row-column permutations, is robust and compares well in terms of efficiency with the aforementioned direct solvers. In a comparison against Intel's

{\tt MKL}

,

{\tt SaP::GPU}

also fares well when used to solve dense banded systems that are close to being diagonally dominant.

{\tt SaP::GPU}$ is publicly available and distributed as open source under a permissive BSD3 license.Comment: 38 page

arXiv.org e-Print Archive

CiteSeerX

Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis

Author: Cheshmi Kazem
Dehnavi Maryam Mehri
Kamil Shoaib
Strout Michelle Mills
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/05/2017
Field of study

Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compile-time and to apply inspector-guided transformations that enable applying low-level transformations to sparse codes. As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Highly parallel sparse Cholesky factorization

Author: Gilbert John R.
Schreiber Robert
Publication venue
Publication date
Field of study

Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms

NASA Technical Reports Server