543 research outputs found
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions
Mathematical operators whose transformation rules constitute the building
blocks of a multi-linear algebra are widely used in physics and engineering
applications where they are very often represented as tensors. In the last
century, thanks to the advances in tensor calculus, it was possible to uncover
new research fields and make remarkable progress in the existing ones, from
electromagnetism to the dynamics of fluids and from the mechanics of rigid
bodies to quantum mechanics of many atoms. By now, the formal mathematical and
geometrical properties of tensors are well defined and understood; conversely,
in the context of scientific and high-performance computing, many tensor-
related problems are still open. In this paper, we address the problem of
efficiently computing contractions among two tensors of arbitrary dimension by
using kernels from the highly optimized BLAS library. In particular, we
establish precise conditions to determine if and when GEMM, the kernel for
matrix products, can be used. Such conditions take into consideration both the
nature of the operation and the storage scheme of the tensors, and induce a
classification of the contractions into three groups. For each group, we
provide a recipe to guide the users towards the most effective use of BLAS.Comment: 27 Pages, 7 figures and additional tikz generated diagrams. Submitted
to Applied Mathematics and Computatio
Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs
Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
A task-based formulation of Scalable Universal Matrix Multiplication
Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is
applied to the multiplication of hierarchy-free, rank-structured matrices that
appear in the domain of quantum chemistry (QC). The novel features of our
formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and
(2) fine-grained task-based composition. These features make it tolerant of the
load imbalance due to the irregular matrix structure and eliminate all
artifactual sources of global synchronization.Scalability of iterative
computation of square-root inverse of block-rank-sparse QC matrices is
demonstrated; for full-rank (dense) matrices the performance of our SUMMA
formulation usually exceeds that of the state-of-the-art dense MM
implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text
overlap with arXiv:1504.0504
- …