9,729 research outputs found

    NVIDIA Tensor Core Programmability, Performance & Precision

    Full text link
    The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

    Multi-Step Knowledge-Aided Iterative ESPRIT for Direction Finding

    Full text link
    In this work, we propose a subspace-based algorithm for DOA estimation which iteratively reduces the disturbance factors of the estimated data covariance matrix and incorporates prior knowledge which is gradually obtained on line. An analysis of the MSE of the reshaped data covariance matrix is carried out along with comparisons between computational complexities of the proposed and existing algorithms. Simulations focusing on closely-spaced sources, where they are uncorrelated and correlated, illustrate the improvements achieved.Comment: 7 figures. arXiv admin note: text overlap with arXiv:1703.1052

    New Algebraic Formulation of Density Functional Calculation

    Full text link
    This article addresses a fundamental problem faced by the ab initio community: the lack of an effective formalism for the rapid exploration and exchange of new methods. To rectify this, we introduce a novel, basis-set independent, matrix-based formulation of generalized density functional theories which reduces the development, implementation, and dissemination of new ab initio techniques to the derivation and transcription of a few lines of algebra. This new framework enables us to concisely demystify the inner workings of fully functional, highly efficient modern ab initio codes and to give complete instructions for the construction of such for calculations employing arbitrary basis sets. Within this framework, we also discuss in full detail a variety of leading-edge ab initio techniques, minimization algorithms, and highly efficient computational kernels for use with scalar as well as shared and distributed-memory supercomputer architectures

    Parallel computation of optimized arrays for 2-D electrical imaging surveys

    Get PDF
    Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman–Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix–vector multiplications for a single add-on array were replaced with matrix–matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on commonly available microcomputers
    • …
    corecore