Search CORE

8 research outputs found

The Linear Algebra Mapping Problem

Author: Barthels Henrik
Bientinesi Paolo
Psarras Christos
Publication venue
Publication date: 21/11/2019
Field of study

We observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations

arXiv.org e-Print Archive

Recommended from our members

Efficiently Mapping Linear Algebra to High-Performance Code

Author: Barthels Henrik
Bientinesi Paolo
Psarras Christos
Publication venue
Publication date: 01/01/2019
Field of study

Aware of the role that linear algebra plays in scientific applications, we investigate if/how matrix expressions can be efficiently evaluated with current high-level languages. On the one hand, the numerical linear algebra community has put a lot of effort in developing and optimizing a relatively small set of “universally” useful operations. These are packaged in libraries such as BLAS and LAPACK, and serve as building blocks for more complex computa- tions. On the other hand, the linear algebra expressions that arise in many domains are significantly more complex than those building blocks. We refer to the problem of expressing a linear algebra expression in terms of a set of available building blocks as the ”Linear Algebra Mapping Problem” (LAMP). In practice, users have two alternatives to solve a given LAMP: 1) either “manually”, by using C/C++ or FORTRAN in combination with explicit calls to BLAS & LAPACK 2) or “automatically” by using one of the high-level languages (or libraries) with an API that directly captures the expressions. In this presentation, we focus only on the latter. Specifically, we consider 6 languages (or libraries): Matlab, Julia, R, NumPy (Python), Eigen (C++), and Armadillo (C++), and carefully assess how effectively they translate linear algebra expressions to code, i.e., how well they solve LAMPs. We investigate a number of aspects that are critical for the efficient solution of a LAMP. These range from the most basic mapping problem “Given the expression A*B, does the language map it to a call to GEMM?”, to the optimal parenthesization, to the exploitation of properties, to the identification & elimination -if advantageous- of common sub-expressions, and more. Ultimately, the purpose of this study is to exhibit the core challenges related to the effective computation of linear algebra expressions, and to help the development of languages and libraries.Texas Advanced Computing Center (TACC

Texas ScholarWorks

An Adaptive Solver for Systems of Linear Equations

Author: Curtin Ryan
Sanderson Conrad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/08/2020
Field of study

Computational implementations for solving systems of linear equations often rely on a one-size-fits-all approach based on LU decomposition of dense matrices stored in column-major format. Such solvers are typically implemented with the aid of the xGESV set of functions available in the low-level LAPACK software, with the aim of reducing development time by taking advantage of well-tested routines. However, this straightforward approach does not take into account various matrix properties which can be exploited to reduce the computational effort and/or to increase numerical stability. Furthermore, direct use of LAPACK functions can be error-prone for non-expert users and results in source code that has little resemblance to originating mathematical expressions. We describe an adaptive solver that we have implemented inside recent versions of the high-level Armadillo C++ library for linear algebra. The solver automatically detects several common properties of a given system (banded, triangular, symmetric positive definite), followed by solving the system via mapping to a set of suitable LAPACK functions best matched to each property. The solver also detects poorly conditioned systems and automatically seeks a solution via singular value decomposition as a fallback. We show that the adaptive solver leads to notable speedups, while also freeing the user from using direct calls to cumbersome LAPACK functions

arXiv.org e-Print Archive

Crossref

Performance of high-order SVD approximation: reading the data twice is enough

Author: Basermann Achim
Röhrig-Zöllner Melven
Thies Jonas
Publication venue
Publication date: 20/05/2021
Field of study

Performance of high-order SVD approximation: reading the data twice is enough ============================================================================= This talk considers the problem of calculating a low-rank tensor approximation of some large dense data. We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks. In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices. Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small. Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice. We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step. Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal. References: Oseledets: "Tensor-Train Decomposition", SISC 2011 Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011 Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015 Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200

Institute of Transport Research:Publications

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning from computer architecture to approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, specific customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize the exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a holistic review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering the reader a wide perspective in this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms

arXiv.org e-Print Archive