    The Linear Algebra Mapping Problem

    We observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations

    An Adaptive Solver for Systems of Linear Equations

    Computational implementations for solving systems of linear equations often rely on a one-size-fits-all approach based on LU decomposition of dense matrices stored in column-major format. Such solvers are typically implemented with the aid of the xGESV set of functions available in the low-level LAPACK software, with the aim of reducing development time by taking advantage of well-tested routines. However, this straightforward approach does not take into account various matrix properties which can be exploited to reduce the computational effort and/or to increase numerical stability. Furthermore, direct use of LAPACK functions can be error-prone for non-expert users and results in source code that has little resemblance to originating mathematical expressions. We describe an adaptive solver that we have implemented inside recent versions of the high-level Armadillo C++ library for linear algebra. The solver automatically detects several common properties of a given system (banded, triangular, symmetric positive definite), followed by solving the system via mapping to a set of suitable LAPACK functions best matched to each property. The solver also detects poorly conditioned systems and automatically seeks a solution via singular value decomposition as a fallback. We show that the adaptive solver leads to notable speedups, while also freeing the user from using direct calls to cumbersome LAPACK functions

    Performance of high-order SVD approximation: reading the data twice is enough

    Performance of high-order SVD approximation: reading the data twice is enough ============================================================================= This talk considers the problem of calculating a low-rank tensor approximation of some large dense data. We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks. In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices. Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small. Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice. We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step. Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal. References: Oseledets: "Tensor-Train Decomposition", SISC 2011 Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011 Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015 Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200

    A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

    In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning from computer architecture to approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, specific customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize the exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a holistic review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering the reader a wide perspective in this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms