24,400 research outputs found
Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation
Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture.
Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation.
For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation.
In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer
Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs
Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity, training time can be reduced through low-precision data representations and computations. However, in doing so the final accuracy suffers due to the problem of vanishing gradients. Existing state-of-the-art methods combat this issue by means of a mixed-precision approach utilising two different precision levels, FP32 (32-bit floating-point) and FP16/FP8 (16-/8-bit floating-point), leveraging the hardware support of recent GPU architectures for FP16 operations to obtain performance gains. This work pushes the boundary of quantised training by employing a multilevel optimisation approach that utilises multiple precisions including low-precision fixed-point representations. The novel training strategy, MuPPET, combines the use of multiple number representation regimes together with a precision-switching mechanism that decides at run time the transition point between precision regimes. Overall, the proposed strategy tailors the training process to the hardware-level capabilities of the target hardware architecture and yields improvements in training time and energy efficiency compared to state-of-the-art approaches. Applying MuPPET on the training of AlexNet, ResNet18 and GoogLeNet on ImageNet (ILSVRC12) and targeting an NVIDIA Turing GPU, MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84 and an average speedup of 1.58 across the networks
Integer Fine-tuning of Transformer-based Models
Transformer based models are used to achieve state-of-the-art performance on
various deep learning tasks. Since transformer-based models have large numbers
of parameters, fine-tuning them on downstream tasks is computationally
intensive and energy hungry. Automatic mixed-precision FP32/FP16 fine-tuning of
such models has been previously used to lower the compute resource
requirements. However, with the recent advances in the low-bit integer
back-propagation, it is possible to further reduce the computation and memory
foot-print. In this work, we explore a novel integer training method that uses
integer arithmetic for both forward propagation and gradient computation of
linear, convolutional, layer-norm, and embedding layers in transformer-based
models. Furthermore, we study the effect of various integer bit-widths to find
the minimum required bit-width for integer fine-tuning of transformer-based
models. We fine-tune BERT and ViT models on popular downstream tasks using
integer layers. We show that 16-bit integer models match the floating-point
baseline performance. Reducing the bit-width to 10, we observe 0.5 average
score drop. Finally, further reduction of the bit-width to 8 provides an
average score drop of 1.7 points
Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations
Although double-precision floating-point arithmetic currently dominates
high-performance computing, there is increasing interest in smaller and simpler
arithmetic types. The main reasons are potential improvements in energy
efficiency and memory footprint and bandwidth. However, simply switching to
lower-precision types typically results in increased numerical errors. We
investigate approaches to improving the accuracy of reduced-precision
fixed-point arithmetic types, using examples in an important domain for
numerical computation in neuroscience: the solution of Ordinary Differential
Equations (ODEs). The Izhikevich neuron model is used to demonstrate that
rounding has an important role in producing accurate spike timings from
explicit ODE solution algorithms. In particular, fixed-point arithmetic with
stochastic rounding consistently results in smaller errors compared to single
precision floating-point and fixed-point arithmetic with round-to-nearest
across a range of neuron behaviours and ODE solvers. A computationally much
cheaper alternative is also investigated, inspired by the concept of dither
that is a widely understood mechanism for providing resolution below the least
significant bit (LSB) in digital signal processing. These results will have
implications for the solution of ODEs in other subject areas, and should also
be directly relevant to the huge range of practical problems that are
represented by Partial Differential Equations (PDEs).Comment: Submitted to Philosophical Transactions of the Royal Society
- …