Search CORE

33,620 research outputs found

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Author: Buluc Aydin
Owens John D.
Yang Carl
Publication venue
Publication date: 14/11/2020
Field of study

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

arXiv.org e-Print Archive

eScholarship - University of California

Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation

Author: Nunez-Iglesias
Stroustrup
Abrahams
Vandevoorde
Saad
Eaton
Duff
Bai
Lehoucq
Lanckriet
Cormen
Anderson
Davis
St. Laurent
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

Despite the importance of sparse matrices in numerous fields of science, software implementations remain difficult to use for non-expert users, generally requiring the understanding of underlying details of the chosen sparse matrix storage format. In addition, to achieve good performance, several formats may need to be used in one program, requiring explicit selection and conversion between the formats. This can be both tedious and error-prone, especially for non-expert users. Motivated by these issues, we present a user-friendly and open-source sparse matrix class for the C++ language, with a high-level application programming interface deliberately similar to the widely used MATLAB language. This facilitates prototyping directly in C++ and aids the conversion of research code into production environments. The class internally uses two main approaches to achieve efficient execution: (i) a hybrid storage framework, which automatically and seamlessly switches between three underlying storage formats (compressed sparse column, Red-Black tree, coordinate list) depending on which format is best suited and/or available for specific operations, and (ii) a template-based meta-programming framework to automatically detect and optimise execution of common expression patterns. Empirical evaluations on large sparse matrices with various densities of non-zero elements demonstrate the advantages of the hybrid storage framework and the expression optimisation mechanism.Comment: extended and revised version of an earlier conference paper arXiv:1805.0338

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

VU Research Portal

EUR Research Repository

University of Melbourne Institutional Repository

University of Queensland eSpace

Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation

Author: Curtin Ryan
Sanderson Conrad
Publication venue: 'MDPI AG'
Publication date: 22/07/2019
Field of study

arXiv.org e-Print Archive

University of Queensland eSpace

Efficient Compilation of a Class of Variational Forms

Author: Anders Logg
Anders Logg
Anders Logg
Anders Logg
Robert C. Kirby
Robert C. Kirby
Robert C. Kirby
Robert C. Kirby
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

We investigate the compilation of general multilinear variational forms over affines simplices and prove a representation theorem for the representation of the element tensor (element stiffness matrix) as the contraction of a constant reference tensor and a geometry tensor that accounts for geometry and variable coefficients. Based on this representation theorem, we design an algorithm for efficient pretabulation of the reference tensor. The new algorithm has been implemented in the FEniCS Form Compiler (FFC) and improves on a previous loop-based implementation by several orders of magnitude, thus shortening compile-times and development cycles for users of FFC.Comment: ACM Transactions on Mathematical Software 33(3), 20 pages (2007

arXiv.org e-Print Archive

CiteSeerX

Crossref

Chalmers Research

Chalmers Publication Library

Faster all-pairs shortest paths via circuit complexity

Author: Aho Alfred V.
Ballard Grey
Bremner David
Kerr Leslie R.
Publication venue
Publication date: 21/05/2014
Field of study

We present a new randomized method for computing the min-plus product (a.k.a., tropical product) of two

n \times n

matrices, yielding a faster algorithm for solving the all-pairs shortest path problem (APSP) in dense

n

-node directed graphs with arbitrary edge weights. On the real RAM, where additions and comparisons of reals are unit cost (but all other operations have typical logarithmic cost), the algorithm runs in time

\frac{n^3}{2^{\Omega(\log n)^{1/2}}}

and is correct with high probability. On the word RAM, the algorithm runs in

n^3/2^{\Omega(\log n)^{1/2}} + n^{2+o(1)}\log M

time for edge weights in

([0,M] \cap {\mathbb Z})\cup\{\infty\}

. Prior algorithms used either

n^3/(\log^c n)

time for various

c \leq 2

, or

O(M^{\alpha}n^{\beta})

time for various

\alpha > 0

and

\beta > 2

. The new algorithm applies a tool from circuit complexity, namely the Razborov-Smolensky polynomials for approximately representing

{\sf AC}^0[p]

circuits, to efficiently reduce a matrix product over the

(\min,+)

algebra to a relatively small number of rectangular matrix products over

{\mathbb F}_2

, each of which are computable using a particularly efficient method due to Coppersmith. We also give a deterministic version of the algorithm running in

n^3/2^{\log^{\delta} n}

time for some

\delta > 0

, which utilizes the Yao-Beigel-Tarui translation of

{\sf AC}^0[m]

circuits into "nice" depth-two circuits.Comment: 24 pages. Updated version now has slightly faster running time. To appear in ACM Symposium on Theory of Computing (STOC), 201

arXiv.org e-Print Archive

Crossref