Search CORE

8 research outputs found

Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language Models

Author: Asgharian Masoud
Chen Boxing
Ghaffari Alireza
Nejad Mahsa Ghazvini
Nia Vahid Partovi
Yu Justin
Publication venue
Publication date: 13/01/2024
Field of study

Low-precision fine-tuning of language models has gained prominence as a cost-effective and energy-efficient approach to deploying large-scale models in various applications. However, this approach is susceptible to the existence of outlier values in activation. The outlier values in the activation can negatively affect the performance of fine-tuning language models in the low-precision regime since they affect the scaling factor and thus make representing smaller values harder. This paper investigates techniques for mitigating outlier activation in low-precision integer fine-tuning of the language models. Our proposed novel approach enables us to represent the outlier activation values in 8-bit integers instead of floating-point (FP16) values. The benefit of using integers for outlier values is that it enables us to use operator tiling to avoid performing 16-bit integer matrix multiplication to address this problem effectively. We provide theoretical analysis and supporting experiments to demonstrate the effectiveness of our approach in improving the robustness and performance of low-precision fine-tuned language models

arXiv.org e-Print Archive

Is Integer Arithmetic Enough for Deep Learning Training?

Author: Asgharian Masoud
Ghaffari Alireza
Nia Vahid Partovi
Tahaei Marzieh S.
Tayaranian Mohammadreza
Publication venue
Publication date: 18/07/2022
Field of study

The ever-increasing computational complexity of deep learning models makes their training and deployment difficult on various cloud and edge platforms. Replacing floating-point arithmetic with low-bit integer arithmetic is a promising approach to save energy, memory footprint, and latency of deep learning models. As such, quantization has attracted the attention of researchers in recent years. However, using integer numbers to form a fully functional integer training pipeline including forward pass, back-propagation, and stochastic gradient descent is not studied in detail. Our empirical and mathematical results reveal that integer arithmetic is enough to train deep learning models. Unlike recent proposals, instead of quantization, we directly switch the number representation of computations. Our novel training method forms a fully integer training pipeline that does not change the trajectory of the loss and accuracy compared to floating-point, nor does it need any special hyper-parameter tuning, distribution adjustment, or gradient clipping. Our experimental results show that our proposed method is effective in a wide variety of tasks such as classification (including vision transformers), object detection, and semantic segmentation

arXiv.org e-Print Archive

ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference

Author: Gamaarachchi Hasindu
Gong Jing
Hu Xiaobo Sharon
Javaid Haris
Parameswaran Sri
Saadat Hassaan
Publication venue
Publication date: 23/09/2022
Field of study

Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPU-accelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is only 8x faster than ApproxTrain.Comment: 14 pages, 12 figure

arXiv.org e-Print Archive

Training with Mixed-Precision Floating-Point Assignments

Author: Aiken Alex
Lee Wonyeol
Sharma Rahul
Publication venue
Publication date: 23/06/2023
Field of study

When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment -- a mapping from all tensors (arising in training) to precision levels (high or low) -- that keeps most of the tensors in low precision and leads to sufficiently accurate models. We provide a technique that explores this memory-accuracy tradeoff by generating precision assignments for convolutional neural networks that (i) use less memory and (ii) lead to more accurate convolutional networks at the same time, compared to the precision assignments considered by prior work in low-precision floating-point training. We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and ImageNet. Our method typically provides > 2x memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading off accuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.Comment: Published in TML

arXiv.org e-Print Archive

On the convergence of stochastic gradient descent in low-precision number formats

Author: Asgharian Masoud
Cacciola Matteo
Frangioni Antonio
Ghaffari Alireza
Nia Vahid Partovi
Publication venue: SciTePress
Publication date: 01/01/2023
Field of study

ABSTRACT: Deep learning models are dominating almost all artificial intelligence tasks such as vision, text, and speech processing. Stochastic Gradient Descent (SGD) is the main tool for training such models, where the computations are usually performed in single-precision floating-point number format. The convergence of single-precision SGD is normally aligned with the theoretical results of real numbers since they exhibit negligible error. However, the numerical error increases when the computations are performed in low-precision number formats. This provides compelling reasons to study the SGD convergence adapted for low-precision computations. We present both deterministic and stochastic analysis of the SGD algorithm, obtaining bounds that show the effect of number format. Such bounds can provide guidelines as to how SGD convergence is affected when constraints render the possibility of performing high-precision computations remote

Archivio della Ricerca - Università di Pisa

PolyPublie

Number Systems for Deep Neural Network Architectures: A Survey

Author: Al-Qutayri Mahmoud
Alsuhli Ghada
Mohammad Baker
Sakellariou Vasileios
Saleh Hani
Stouraitis Thanos
Publication venue
Publication date: 11/07/2023
Field of study

Deep neural networks (DNNs) have become an enabling component for a myriad of artificial intelligence applications. DNNs have shown sometimes superior performance, even compared to humans, in cases such as self-driving, health applications, etc. Because of their computational complexity, deploying DNNs in resource-constrained devices still faces many challenges related to computing complexity, energy efficiency, latency, and cost. To this end, several research directions are being pursued by both academia and industry to accelerate and efficiently implement DNNs. One important direction is determining the appropriate data representation for the massive amount of data involved in DNN processing. Using conventional number systems has been found to be sub-optimal for DNNs. Alternatively, a great body of research focuses on exploring suitable number systems. This article aims to provide a comprehensive survey and discussion about alternative number systems for more efficient representations of DNN data. Various number systems (conventional/unconventional) exploited for DNNs are discussed. The impact of these number systems on the performance and hardware design of DNNs is considered. In addition, this paper highlights the challenges associated with each number system and various solutions that are proposed for addressing them. The reader will be able to understand the importance of an efficient number system for DNN, learn about the widely used number systems for DNN, understand the trade-offs between various number systems, and consider various design aspects that affect the impact of number systems on DNN performance. In addition, the recent trends and related research opportunities will be highlightedComment: 28 page

arXiv.org e-Print Archive

Recommended from our members

Distributed learning under resource constraints

Author: Chen Yiyue
Publication venue
Publication date: 17/07/2024
Field of study

The enormous amount of data encountered in control, signal processing, and machine learning applications presents numerous computational challenges. In many settings, data is distributed across numerous devices and needs to be processed, stored and communicated, naturally leading to distributed optimization problems. It is often the case that the communication networks connecting the devices are characterized by unidirectional and time-varying communication links, and that the devices face resource constraints. Therefore, it is of importance to design learning algorithms that can achieve required accuracy and convergence properties while satisfying the computation and communication constraints. This dissertation first focuses on distributed convex optimization tasks deployed over time-varying networks with communication constraints. For this setting, we developed provably converging communication-efficient algorithms that rely on sparsification to reduce the communication cost; efficacy of the developed framework was demonstrated on computer vision and natural language processing tasks. Then we turn our attention to more general, challenging decentralized non-convex problems over directed time-varying networks with stochastic first-order oracle and local computation constraints, for which we develop a provably fast converging algorithm achieving highly accurate performance. The final part of the dissertation focuses on the practical scenario of distributed learning applications where the clients are heterogeneous and operate under constraints on the local computation power, memory footprint and communication bandwidth and propose Federated Quantized Self Supervised Learning (Fed-QSSL) algorithm, an effective framework for federated learning under bitwidth constraints and data heterogeneity. We theoretically analyze the impact of low-bit training on the convergence and robustness of federated learning, and experimentally demonstrate that Fed-QSSL achieves more robust and personalized performance than the competing methods.Electrical and Computer Engineerin

Texas ScholarWorks