3 research outputs found
Is Integer Arithmetic Enough for Deep Learning Training?
The ever-increasing computational complexity of deep learning models makes
their training and deployment difficult on various cloud and edge platforms.
Replacing floating-point arithmetic with low-bit integer arithmetic is a
promising approach to save energy, memory footprint, and latency of deep
learning models. As such, quantization has attracted the attention of
researchers in recent years. However, using integer numbers to form a fully
functional integer training pipeline including forward pass, back-propagation,
and stochastic gradient descent is not studied in detail. Our empirical and
mathematical results reveal that integer arithmetic is enough to train deep
learning models. Unlike recent proposals, instead of quantization, we directly
switch the number representation of computations. Our novel training method
forms a fully integer training pipeline that does not change the trajectory of
the loss and accuracy compared to floating-point, nor does it need any special
hyper-parameter tuning, distribution adjustment, or gradient clipping. Our
experimental results show that our proposed method is effective in a wide
variety of tasks such as classification (including vision transformers), object
detection, and semantic segmentation
Efficient Fine-Tuning of Compressed Language Models with Learners
Fine-tuning BERT-based models is resource-intensive in memory, computation,
and time. While many prior works aim to improve inference efficiency via
compression techniques, e.g., pruning, these works do not explicitly address
the computational challenges of training to downstream tasks. We introduce
Learner modules and priming, novel methods for fine-tuning that exploit the
overparameterization of pre-trained language models to gain benefits in
convergence speed and resource utilization. Learner modules navigate the double
bind of 1) training efficiently by fine-tuning a subset of parameters, and 2)
training effectively by ensuring quick convergence and high metric scores. Our
results on DistilBERT demonstrate that learners perform on par with or surpass
the baselines. Learners train 7x fewer parameters than state-of-the-art methods
on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower
resource utilization.Comment: 8 pages, 9 figures, 2 tables, presented at ICML 2022 workshop on
Hardware-Aware Efficient Training (HAET 2022
Integer Fine-tuning of Transformer-based Models
Transformer based models are used to achieve state-of-the-art performance on
various deep learning tasks. Since transformer-based models have large numbers
of parameters, fine-tuning them on downstream tasks is computationally
intensive and energy hungry. Automatic mixed-precision FP32/FP16 fine-tuning of
such models has been previously used to lower the compute resource
requirements. However, with the recent advances in the low-bit integer
back-propagation, it is possible to further reduce the computation and memory
foot-print. In this work, we explore a novel integer training method that uses
integer arithmetic for both forward propagation and gradient computation of
linear, convolutional, layer-norm, and embedding layers in transformer-based
models. Furthermore, we study the effect of various integer bit-widths to find
the minimum required bit-width for integer fine-tuning of transformer-based
models. We fine-tune BERT and ViT models on popular downstream tasks using
integer layers. We show that 16-bit integer models match the floating-point
baseline performance. Reducing the bit-width to 10, we observe 0.5 average
score drop. Finally, further reduction of the bit-width to 8 provides an
average score drop of 1.7 points