9,729 research outputs found
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Multi-Step Knowledge-Aided Iterative ESPRIT for Direction Finding
In this work, we propose a subspace-based algorithm for DOA estimation which
iteratively reduces the disturbance factors of the estimated data covariance
matrix and incorporates prior knowledge which is gradually obtained on line. An
analysis of the MSE of the reshaped data covariance matrix is carried out along
with comparisons between computational complexities of the proposed and
existing algorithms. Simulations focusing on closely-spaced sources, where they
are uncorrelated and correlated, illustrate the improvements achieved.Comment: 7 figures. arXiv admin note: text overlap with arXiv:1703.1052
New Algebraic Formulation of Density Functional Calculation
This article addresses a fundamental problem faced by the ab initio
community: the lack of an effective formalism for the rapid exploration and
exchange of new methods. To rectify this, we introduce a novel, basis-set
independent, matrix-based formulation of generalized density functional
theories which reduces the development, implementation, and dissemination of
new ab initio techniques to the derivation and transcription of a few lines of
algebra. This new framework enables us to concisely demystify the inner
workings of fully functional, highly efficient modern ab initio codes and to
give complete instructions for the construction of such for calculations
employing arbitrary basis sets. Within this framework, we also discuss in full
detail a variety of leading-edge ab initio techniques, minimization algorithms,
and highly efficient computational kernels for use with scalar as well as
shared and distributed-memory supercomputer architectures
Parallel computation of optimized arrays for 2-D electrical imaging surveys
Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman–Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix–vector multiplications for a single add-on array were replaced with matrix–matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on commonly available microcomputers
- …