Search CORE

30 research outputs found

An Error Correction Solver for Linear Systems: Evaluation of Mixed Precision Implementations

Author: Anzt Hartwig
Heuveline Vincent
Rocker Björn
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2010
Field of study

KITopen

Mixed Precision Error Correction Methods for Linear Systems: Convergence Analysis based on Krylov Subspace Methods

Author: Anzt Hartwig
Heuveline Vincent
Rocker Björn
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2010
Field of study

KITopen

A Fast GPU-accelerated Mixed-precision Strategy for Fully NonlinearWater Wave Computations

Author: A Buttari
A P Engsig-Karup
AP Engsig-Karup
B Li
H B Bingham
M Baboulin
R S Martin
Publication venue: 'University of Leicester'
Publication date: 01/01/2011
Field of study

Crossref

Online Research Database In Technology

Energy Efficiency of Mixed Precision Iterative Refinement Methods using Hybrid Hardware Platforms: An Evaluation of different Solver and Hardware Configurations

Author: Anzt Hartwig
Heuveline Vincent
Rocker Björn
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2010
Field of study

KITopen

NVIDIA Tensor Core Programmability, Performance & Precision

Author: Der Chien Steven Wei
Laure Erwin
Markidis Stefano
Peng Ivy Bo
Vetter Jeffrey S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/03/2018
Field of study

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

arXiv.org e-Print Archive

Crossref

Mixed precision bisection

Author: A Buttari
A Buttari
BN Parlett
E Anderson
G Golub
I Yamazaki
J Kurzak
JH Wilkinson
JW Demmel
JW Demmel
JW Demmel
L Blackford
l Giraud
M Petschow
R Ralha
Rui Ralha
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

We discuss the implementation of the bisection algorithm for the computation of the eigenvalues of symmetric tridiagonal matrices in a context of mixed precision arithmetic. This approach is motivated by the emergence of processors which carry out floating-point operations much faster in single precision than they do in double precision. Perturbation theory results are used to decide when to switch from single to double precision. Numerical examples are presente

Universidade do Minho: RepositoriUM

Crossref

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement

Author: Anzt H.
Dongarra J.
Heuveline Vincent
Luszczek P.
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2011
Field of study

In hardware-aware high performance computing, block-asynchronous iteration and mixed precision iterative refinement are two techniques that may be used to leverage the computing power of SIMD accelerators like GPUs in the iterative solution of linear equation systems. although they use a very different approach for this purpose, they share the basic idea of compensating the convergence properties of an inferior numerical algorithm by a more efficient usage of the provided computing power. In this paper, we analyze the potential of combining both techniques. Therefore, we derive a mixed precision iterative refinement algorithm using a block-asynchronous iteration as an error correction solver, and compare its performance with a pure implementation of a block-asynchronous iteration and an iterative refinement method using double precision for the error correction solver. For matrices from the University of Florida Matrix collection, we report the convergence behaviour and provide the total solver runtime using different GPU architectures

KITopen

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

Author: Jordà Marc
Peña Monferrer Antonio José
Valero-Lara Pedro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC