22,263 research outputs found
Recommended from our members
On the Optimal Solution of Large Linear Systems
The information-based study of the optimal solution of large linear systems is initiated by studying the case of Krylov information. Among the algorithms which use Krylov information are minimal residual, conjugate gradient, Chebyshev, and successive approximation algorithms. A "sharp" lower bound on the number of matrix-vector multiplications required to compute an E- approximation is obtained for any orthogonally invariant class of matrices. Examples of such classes include many of practical interest such as symmetric matrices, symmetric positive definite matrices, and matrices with bounded condition number. It is shown that the minimal residual algorithm is within at most one matrix-vector multiplication of the lower bound. A similar result is obtained for the generalized minimal residual algorithm. The lower bound is computed for certain classes of orthogonally invariant matrices. We show how the lack of certain properties (symmetry, positive definiteness) increases the lower bound. A conjecture and a number of open problems are stated
Lightweight Diffusion Layer from the root of the MDS Matrix
The Maximum Distance Separable (MDS) mapping, used in cryptography deploys complex Galois field multiplications, which consume lots of area in hardware, making it a costly primitive for lightweight cryptography. Recently in lightweight hash function: PHOTON, a matrix denoted as ‘Serial’, which required less area for multiplication, has been multiplied 4 times to achieve a lightweight MDS mapping. But no efficient method has been proposed so far to synthesize such a serial matrix or to find the required number of repetitive multiplications needed to be performed for a given MDS mapping. In this paper, first we provide an generic algorithm to find out a low-cost matrix, which can be multiplied k times to obtain a given MDS mapping. Further, we optimize the algorithm for using in cryptography and show an explicit case study on the MDS mapping of the hash function PHOTON to obtain the ‘Serial’. The work also presents quite a few results which may be interesting for lightweight implementation
New Matrix Series Formulae for Matrix Exponentials and for the Solution of Linear Systems of Algebraic Equations
The solution of certain differential equations is expressed using a special type of matrix series and is directly related to the solution of general systems of algebraic equations. Efficient formulae for matrix exponentials are derived in terms of rapidly convergent series of the same type. They are essential for two new solution methods, especially beneficial for large linear systems, namely an iterative method and a method based on an exact matrix product formula. The computational complexity of these two methods is analysed, and for both of them, the number of matrix exponential-vector multiplications required for an imposed accuracy can be predetermined in terms of the system condition. The total number of arithmetic operations involved is roughly proportional to n2, where n is the matrix dimension. The common feature of all the series in the results presented is that starting with a first term that is already well-conditioned, each subsequent term is computed by multiplication with an even better conditioned matrix, tending quickly to the identity matrix. This contributes substantially to the stability of the numerical computation. A very efficient method based on the numerical integration of a special kind of differential equations, applicable to even ill-conditioned systems, is also presented
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
- …