21,169 research outputs found
Mixed-precision deep learning based on computational memory
Deep neural networks (DNNs) have revolutionized the field of artificial
intelligence and have achieved unprecedented success in cognitive tasks such as
image and speech recognition. Training of large DNNs, however, is
computationally intensive and this has motivated the search for novel computing
architectures targeting this application. A computational memory unit with
nanoscale resistive memory devices organized in crossbar arrays could store the
synaptic weights in their conductance states and perform the expensive weighted
summations in place in a non-von Neumann manner. However, updating the
conductance states in a reliable manner during the weight update process is a
fundamental challenge that limits the training accuracy of such an
implementation. Here, we propose a mixed-precision architecture that combines a
computational memory unit performing the weighted summations and imprecise
conductance updates with a digital processing unit that accumulates the weight
updates in high precision. A combined hardware/software training experiment of
a multilayer perceptron based on the proposed architecture using a phase-change
memory (PCM) array achieves 97.73% test accuracy on the task of classifying
handwritten digits (based on the MNIST dataset), within 0.6% of the software
baseline. The architecture is further evaluated using accurate behavioral
models of PCM on a wide class of networks, namely convolutional neural
networks, long-short-term-memory networks, and generative-adversarial networks.
Accuracies comparable to those of floating-point implementations are achieved
without being constrained by the non-idealities associated with the PCM
devices. A system-level study demonstrates 173x improvement in energy
efficiency of the architecture when used for training a multilayer perceptron
compared with a dedicated fully digital 32-bit implementation
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
- …