58,141 research outputs found
Enabling Massive Deep Neural Networks with the GraphBLAS
Deep Neural Networks (DNNs) have emerged as a core tool for machine learning.
The computations performed during DNN training and inference are dominated by
operations on the weight matrices describing the DNN. As DNNs incorporate more
stages and more nodes per stage, these weight matrices may be required to be
sparse because of memory limitations. The GraphBLAS.org math library standard
was developed to provide high performance manipulation of sparse weight
matrices and input/output vectors. For sufficiently sparse matrices, a sparse
matrix library requires significantly less memory than the corresponding dense
matrix implementation. This paper provides a brief description of the
mathematics underlying the GraphBLAS. In addition, the equations of a typical
DNN are rewritten in a form designed to use the GraphBLAS. An implementation of
the DNN is given using a preliminary GraphBLAS C library. The performance of
the GraphBLAS implementation is measured relative to a standard dense linear
algebra library implementation. For various sizes of DNN weight matrices, it is
shown that the GraphBLAS sparse implementation outperforms a BLAS dense
implementation as the weight matrix becomes sparser.Comment: 10 pages, 7 figures, to appear in the 2017 IEEE High Performance
Extreme Computing (HPEC) conferenc
A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices
We present the submatrix method, a highly parallelizable method for the
approximate calculation of inverse p-th roots of large sparse symmetric
matrices which are required in different scientific applications. We follow the
idea of Approximate Computing, allowing imprecision in the final result in
order to be able to utilize the sparsity of the input matrix and to allow
massively parallel execution. For an n x n matrix, the proposed algorithm
allows to distribute the calculations over n nodes with only little
communication overhead. The approximate result matrix exhibits the same
sparsity pattern as the input matrix, allowing for efficient reuse of allocated
data structures.
We evaluate the algorithm with respect to the error that it introduces into
calculated results, as well as its performance and scalability. We demonstrate
that the error is relatively limited for well-conditioned matrices and that
results are still valuable for error-resilient applications like
preconditioning even for ill-conditioned matrices. We discuss the execution
time and scaling of the algorithm on a theoretical level and present a
distributed implementation of the algorithm using MPI and OpenMP. We
demonstrate the scalability of this implementation by running it on a
high-performance compute cluster comprised of 1024 CPU cores, showing a speedup
of 665x compared to single-threaded execution
On large-scale diagonalization techniques for the Anderson model of localization
We propose efficient preconditioning algorithms for an eigenvalue problem arising in quantum physics, namely the computation of a few interior eigenvalues and their associated eigenvectors for large-scale sparse real and symmetric indefinite matrices of the Anderson model
of localization. We compare the Lanczos algorithm in the 1987 implementation by Cullum and Willoughby with the shift-and-invert techniques in the implicitly restarted Lanczos method and in the JacobiāDavidson method. Our preconditioning approaches for the shift-and-invert symmetric indefinite linear system are based on maximum weighted matchings and algebraic multilevel incomplete
LDLT factorizations. These techniques can be seen as a complement to the alternative idea of using more complete pivoting techniques for the highly ill-conditioned symmetric indefinite Anderson matrices. We demonstrate the effectiveness and the numerical accuracy of these algorithms. Our numerical examples reveal that recent algebraic multilevel preconditioning solvers can accelerate the computation of a large-scale eigenvalue problem corresponding to the Anderson model of localization
by several orders of magnitude
Inverse, forward and other dynamic computations computationally optimized with sparse matrix factorizations
We propose an algorithm to compute the dynamics of articulated rigid-bodies
with different sensor distributions. Prior to the on-line computations, the
proposed algorithm performs an off-line optimisation step to simplify the
computational complexity of the underlying solution. This optimisation step
consists in formulating the dynamic computations as a system of linear
equations. The computational complexity of computing the associated solution is
reduced by performing a permuted LU-factorisation with off-line optimised
permutations. We apply our algorithm to solve classical dynamic problems:
inverse and forward dynamics. The computational complexity of the proposed
solution is compared to `gold standard' algorithms: recursive Newton-Euler and
articulated body algorithm. It is shown that our algorithm reduces the number
of floating point operations with respect to previous approaches. We also
evaluate the numerical complexity of our algorithm by performing tests on
dynamic computations for which no gold standard is available.Comment: 8 pages, 2 figure, conference RCAR 201
- ā¦