206 research outputs found

    NVIDIA Tensor Core Programmability, Performance & Precision

    Full text link
    The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

    CLBlast: A Tuned OpenCL BLAS Library

    Full text link
    This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.Comment: Conference paper in: IWOCL '18, the International Workshop on OpenC

    A μ\mu-mode integrator for solving evolution equations in Kronecker form

    Full text link
    In this paper, we propose a μ\mu-mode integrator for computing the solution of stiff evolution equations. The integrator is based on a d-dimensional splitting approach and uses exact (usually precomputed) one-dimensional matrix exponentials. We show that the action of the exponentials, i.e. the corresponding batched matrix-vector products, can be implemented efficiently on modern computer systems. We further explain how μ\mu-mode products can be used to compute spectral transformations efficiently even if no fast transform is available. We illustrate the performance of the new integrator by solving three-dimensional linear and nonlinear Schr\"odinger equations, and we show that the μ\mu-mode integrator can significantly outperform numerical methods well established in the field. We also discuss how to efficiently implement this integrator on both multi-core CPUs and GPUs. Finally, the numerical experiments show that using GPUs results in performance improvements between a factor of 10 and 20, depending on the problem

    Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

    Get PDF
    Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

    Mixed-Precision Numerical Linear Algebra Algorithms: Integer Arithmetic Based LU Factorization and Iterative Refinement for Hermitian Eigenvalue Problem

    Get PDF
    Mixed-precision algorithms are a class of algorithms that uses low precision in part of the algorithm in order to save time and energy with less accurate computation and communication. These algorithms usually utilize iterative refinement processes to improve the approximate solution obtained from low precision to the accuracy we desire from doing all the computation in high precision. Due to the demand of deep learning applications, there are hardware developments offering different low-precision formats including half precision (FP16), Bfloat16 and integer operations for quantized integers, which uses integers with a shared scalar to represent a set of equally spaced numbers. As new hardware architectures focus on bringing performance in these formats, the mixed-precision algorithms have more potential leverage on them and outmatch traditional fixed-precision algorithms. This dissertation consists of two articles. In the first article, we adapt one of the most fundamental algorithms in numerical linear algebra---LU factorization with partial pivoting--- to use integer arithmetic. With the goal of obtaining a low accuracy factorization as the preconditioner of generalized minimal residual (GMRES) to solve systems of linear equations, the LU factorization is adapted to use two different fixed-point formats for matrices L and U. A left-looking variant is also proposed for matrices with unbounded column growth. Finally, GMRES iterative refinement has shown that it can work on matrices with condition numbers up to 10000 with the algorithm that uses int16 as input and int32 accumulator for the update step. The second article targets symmetric and Hermitian eigenvalue problems. In this section we revisit the SICE algorithm from Dongarra et al. By applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested
    corecore