8 research outputs found
The Linear Algebra Mapping Problem
We observe a disconnect between the developers and the end users of linear
algebra libraries. On the one hand, the numerical linear algebra and the
high-performance communities invest significant effort in the development and
optimization of highly sophisticated numerical kernels and libraries, aiming at
the maximum exploitation of both the properties of the input matrices, and the
architectural features of the target computing platform. On the other hand, end
users are progressively less likely to go through the error-prone and time
consuming process of directly using said libraries by writing their code in C
or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and
Armadillo, which offer a higher level of abstraction, are becoming more and
more popular. Users are given the opportunity to code matrix computations with
a syntax that closely resembles the mathematical description; it is then a
compiler or an interpreter that internally maps the input program to lower
level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately,
our experience suggests that in terms of performance, this translation is
typically vastly suboptimal.
In this paper, we first introduce the Linear Algebra Mapping Problem, and
then investigate how effectively a benchmark of test problems is solved by
popular high-level programming languages. Specifically, we consider Matlab,
Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the
benchmark is meant to test both standard compiler optimizations such as common
subexpression elimination and loop-invariant code motion, as well as linear
algebra specific optimizations such as optimal parenthesization of a matrix
product and kernel selection for matrices with properties. The aim of this
study is to give concrete guidelines for the development of languages and
libraries that support linear algebra computations
Recommended from our members
Efficiently Mapping Linear Algebra to High-Performance Code
Aware of the role that linear algebra plays in scientific applications, we investigate if/how matrix expressions can be efficiently evaluated with current high-level languages. On the one hand, the numerical linear algebra community has put a lot of effort in developing and optimizing a relatively small set of “universally” useful operations. These are packaged in libraries such as BLAS and LAPACK, and serve as building blocks for more complex computa- tions. On the other hand, the linear algebra expressions that arise in many domains are significantly more complex than those building blocks. We refer to the problem of expressing a linear algebra expression in terms of a set of available building blocks as the ”Linear Algebra Mapping Problem” (LAMP). In practice, users have two alternatives to solve a given LAMP: 1) either “manually”, by using C/C++ or FORTRAN in combination with explicit calls to BLAS & LAPACK 2) or “automatically” by using one of the high-level languages (or libraries) with an API that directly captures the expressions. In this presentation, we focus only on the latter. Specifically, we consider 6 languages (or libraries): Matlab, Julia, R, NumPy (Python), Eigen (C++), and Armadillo (C++), and carefully assess how effectively they translate linear algebra expressions to code, i.e., how well they solve LAMPs. We investigate a number of aspects that are critical for the efficient solution of a LAMP. These range from the most basic mapping problem “Given the expression A*B, does the language map it to a call to GEMM?”, to the optimal parenthesization, to the exploitation of properties, to the identification & elimination -if advantageous- of common sub-expressions, and more. Ultimately, the purpose of this study is to exhibit the core challenges related to the effective computation of linear algebra expressions, and to help the development of languages and libraries.Texas Advanced Computing Center (TACC
An Adaptive Solver for Systems of Linear Equations
Computational implementations for solving systems of linear equations often
rely on a one-size-fits-all approach based on LU decomposition of dense
matrices stored in column-major format. Such solvers are typically implemented
with the aid of the xGESV set of functions available in the low-level LAPACK
software, with the aim of reducing development time by taking advantage of
well-tested routines. However, this straightforward approach does not take into
account various matrix properties which can be exploited to reduce the
computational effort and/or to increase numerical stability. Furthermore,
direct use of LAPACK functions can be error-prone for non-expert users and
results in source code that has little resemblance to originating mathematical
expressions. We describe an adaptive solver that we have implemented inside
recent versions of the high-level Armadillo C++ library for linear algebra. The
solver automatically detects several common properties of a given system
(banded, triangular, symmetric positive definite), followed by solving the
system via mapping to a set of suitable LAPACK functions best matched to each
property. The solver also detects poorly conditioned systems and automatically
seeks a solution via singular value decomposition as a fallback. We show that
the adaptive solver leads to notable speedups, while also freeing the user from
using direct calls to cumbersome LAPACK functions
Performance of high-order SVD approximation: reading the data twice is enough
Performance of high-order SVD approximation: reading the data twice is enough
=============================================================================
This talk considers the problem of calculating a low-rank tensor approximation of some large dense data.
We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks.
In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices.
Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small.
Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice.
We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step.
Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal.
References:
Oseledets: "Tensor-Train Decomposition", SISC 2011
Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011
Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015
Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms