8 research outputs found
Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language Models
Low-precision fine-tuning of language models has gained prominence as a
cost-effective and energy-efficient approach to deploying large-scale models in
various applications. However, this approach is susceptible to the existence of
outlier values in activation. The outlier values in the activation can
negatively affect the performance of fine-tuning language models in the
low-precision regime since they affect the scaling factor and thus make
representing smaller values harder. This paper investigates techniques for
mitigating outlier activation in low-precision integer fine-tuning of the
language models. Our proposed novel approach enables us to represent the
outlier activation values in 8-bit integers instead of floating-point (FP16)
values. The benefit of using integers for outlier values is that it enables us
to use operator tiling to avoid performing 16-bit integer matrix multiplication
to address this problem effectively. We provide theoretical analysis and
supporting experiments to demonstrate the effectiveness of our approach in
improving the robustness and performance of low-precision fine-tuned language
models
Is Integer Arithmetic Enough for Deep Learning Training?
The ever-increasing computational complexity of deep learning models makes
their training and deployment difficult on various cloud and edge platforms.
Replacing floating-point arithmetic with low-bit integer arithmetic is a
promising approach to save energy, memory footprint, and latency of deep
learning models. As such, quantization has attracted the attention of
researchers in recent years. However, using integer numbers to form a fully
functional integer training pipeline including forward pass, back-propagation,
and stochastic gradient descent is not studied in detail. Our empirical and
mathematical results reveal that integer arithmetic is enough to train deep
learning models. Unlike recent proposals, instead of quantization, we directly
switch the number representation of computations. Our novel training method
forms a fully integer training pipeline that does not change the trajectory of
the loss and accuracy compared to floating-point, nor does it need any special
hyper-parameter tuning, distribution adjustment, or gradient clipping. Our
experimental results show that our proposed method is effective in a wide
variety of tasks such as classification (including vision transformers), object
detection, and semantic segmentation
ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference
Edge training of Deep Neural Networks (DNNs) is a desirable goal for
continuous learning; however, it is hindered by the enormous computational
power required by training. Hardware approximate multipliers have shown their
effectiveness for gaining resource-efficiency in DNN inference accelerators;
however, training with approximate multipliers is largely unexplored. To build
resource efficient accelerators with approximate multipliers supporting DNN
training, a thorough evaluation of training convergence and accuracy for
different DNN architectures and different approximate multipliers is needed.
This paper presents ApproxTrain, an open-source framework that allows fast
evaluation of DNN training and inference using simulated approximate
multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires
only a high-level description of a DNN architecture along with C/C++ functional
models of the approximate multiplier. We improve the speed of the simulation at
the multiplier level by using a novel LUT-based approximate floating-point (FP)
multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently
integrates AMSim into the TensorFlow library, in order to overcome the absence
of native hardware approximate multiplier in commercial GPUs. We use
ApproxTrain to evaluate the convergence and accuracy of DNN training with
approximate multipliers for small and large datasets (including ImageNet) using
LeNets and ResNets architectures. The evaluations demonstrate similar
convergence behavior and negligible change in test accuracy compared to FP32
and bfloat16 multipliers. Compared to CPU-based approximate multiplier
simulations in training and inference, the GPU-accelerated ApproxTrain is more
than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS
libraries with native hardware multipliers, the original TensorFlow is only 8x
faster than ApproxTrain.Comment: 14 pages, 12 figure
Training with Mixed-Precision Floating-Point Assignments
When training deep neural networks, keeping all tensors in high precision
(e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all
tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy
loss. Hence, it is important to use a precision assignment -- a mapping from
all tensors (arising in training) to precision levels (high or low) -- that
keeps most of the tensors in low precision and leads to sufficiently accurate
models. We provide a technique that explores this memory-accuracy tradeoff by
generating precision assignments for convolutional neural networks that (i) use
less memory and (ii) lead to more accurate convolutional networks at the same
time, compared to the precision assignments considered by prior work in
low-precision floating-point training. We evaluate our technique on image
classification tasks by training convolutional networks on CIFAR-10, CIFAR-100,
and ImageNet. Our method typically provides > 2x memory reduction over a
baseline precision assignment while preserving training accuracy, and gives
further reductions by trading off accuracy. Compared to other baselines which
sometimes cause training to diverge, our method provides similar or better
memory reduction while avoiding divergence.Comment: Published in TML
On the convergence of stochastic gradient descent in low-precision number formats
ABSTRACT: Deep learning models are dominating almost all artificial intelligence tasks such as vision, text, and speech processing. Stochastic Gradient Descent (SGD) is the main tool for training such models, where the computations are usually performed in single-precision floating-point number format. The convergence of single-precision SGD is normally aligned with the theoretical results of real numbers since they exhibit negligible error. However, the numerical error increases when the computations are performed in low-precision number formats. This provides compelling reasons to study the SGD convergence adapted for low-precision computations. We present both deterministic and stochastic analysis of the SGD algorithm, obtaining bounds that show the effect of number format. Such bounds can provide guidelines as to how SGD convergence is affected when constraints render the possibility of performing high-precision computations remote
Number Systems for Deep Neural Network Architectures: A Survey
Deep neural networks (DNNs) have become an enabling component for a myriad of
artificial intelligence applications. DNNs have shown sometimes superior
performance, even compared to humans, in cases such as self-driving, health
applications, etc. Because of their computational complexity, deploying DNNs in
resource-constrained devices still faces many challenges related to computing
complexity, energy efficiency, latency, and cost. To this end, several research
directions are being pursued by both academia and industry to accelerate and
efficiently implement DNNs. One important direction is determining the
appropriate data representation for the massive amount of data involved in DNN
processing. Using conventional number systems has been found to be sub-optimal
for DNNs. Alternatively, a great body of research focuses on exploring suitable
number systems. This article aims to provide a comprehensive survey and
discussion about alternative number systems for more efficient representations
of DNN data. Various number systems (conventional/unconventional) exploited for
DNNs are discussed. The impact of these number systems on the performance and
hardware design of DNNs is considered. In addition, this paper highlights the
challenges associated with each number system and various solutions that are
proposed for addressing them. The reader will be able to understand the
importance of an efficient number system for DNN, learn about the widely used
number systems for DNN, understand the trade-offs between various number
systems, and consider various design aspects that affect the impact of number
systems on DNN performance. In addition, the recent trends and related research
opportunities will be highlightedComment: 28 page
Recommended from our members
Distributed learning under resource constraints
The enormous amount of data encountered in control, signal processing, and machine learning applications presents numerous computational challenges. In many settings, data is distributed across numerous devices and needs to be processed, stored and communicated, naturally leading to distributed optimization problems. It is often the case that the communication networks connecting the devices are characterized by unidirectional and time-varying communication links, and that the devices face resource constraints. Therefore, it is of importance to design learning algorithms that can achieve required accuracy and convergence properties while satisfying the computation and communication constraints. This dissertation first focuses on distributed convex optimization tasks deployed over time-varying networks with communication constraints. For this setting, we developed provably converging communication-efficient algorithms that rely on sparsification to reduce the communication cost; efficacy of the developed framework was demonstrated on computer vision and natural language processing tasks. Then we turn our attention to more general, challenging decentralized non-convex problems over directed time-varying networks with stochastic first-order oracle and local computation constraints, for which we develop a provably fast converging algorithm achieving highly accurate performance. The final part of the dissertation focuses on the practical scenario of distributed learning applications where the clients are heterogeneous and operate under constraints on the local computation power, memory footprint and communication bandwidth and propose Federated Quantized Self Supervised Learning (Fed-QSSL) algorithm, an effective framework for federated learning under bitwidth constraints and data heterogeneity. We theoretically analyze the impact of low-bit training on the convergence and robustness of federated learning, and experimentally demonstrate that Fed-QSSL achieves more robust and personalized performance than the competing methods.Electrical and Computer Engineerin