176 research outputs found
Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language Models
Low-precision fine-tuning of language models has gained prominence as a
cost-effective and energy-efficient approach to deploying large-scale models in
various applications. However, this approach is susceptible to the existence of
outlier values in activation. The outlier values in the activation can
negatively affect the performance of fine-tuning language models in the
low-precision regime since they affect the scaling factor and thus make
representing smaller values harder. This paper investigates techniques for
mitigating outlier activation in low-precision integer fine-tuning of the
language models. Our proposed novel approach enables us to represent the
outlier activation values in 8-bit integers instead of floating-point (FP16)
values. The benefit of using integers for outlier values is that it enables us
to use operator tiling to avoid performing 16-bit integer matrix multiplication
to address this problem effectively. We provide theoretical analysis and
supporting experiments to demonstrate the effectiveness of our approach in
improving the robustness and performance of low-precision fine-tuned language
models
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Transformer models have been widely adopted in various domains over the last
years, and especially large language models have advanced the field of AI
significantly. Due to their size, the capability of these networks has
increased tremendously, but this has come at the cost of a significant increase
in necessary compute. Quantization is one of the most effective ways to reduce
the computational time and memory consumption of neural networks. Many studies
have shown, however, that modern transformer models tend to learn strong
outliers in their activations, making them difficult to quantize. To retain
acceptable performance, the existence of these outliers requires activations to
be in higher bitwidth or the use of different numeric formats, extra
fine-tuning, or other workarounds. We show that strong outliers are related to
very specific behavior of attention heads that try to learn a "no-op" or just a
partial update of the residual. To achieve the exact zeros needed in the
attention matrix for a no-update, the input to the softmax is pushed to be
larger and larger during training, causing outliers in other parts of the
network. Based on these observations, we propose two simple (independent)
modifications to the attention mechanism - clipped softmax and gated attention.
We empirically show that models pre-trained using our methods learn
significantly smaller outliers while maintaining and sometimes even improving
the floating-point task performance. This enables us to quantize transformers
to full INT8 quantization of the activations without any additional effort. We
demonstrate the effectiveness of our methods on both language models (BERT,
OPT) and vision transformers
- …