433 research outputs found
LNCS
Quantization converts neural networks into low-bit fixed-point computations which can be carried out by efficient integer-only hardware, and is standard practice for the deployment of neural networks on real-time embedded devices. However, like their real-numbered counterpart, quantized networks are not immune to malicious misclassification caused by adversarial attacks. We investigate how quantization affects a network’s robustness to adversarial attacks, which is a formal verification question. We show that neither robustness nor non-robustness are monotonic with changing the number of bits for the representation and, also, neither are preserved by quantization from a real-numbered network. For this reason, we introduce a verification method for quantized neural networks which, using SMT solving over bit-vectors, accounts for their exact, bit-precise semantics. We built a tool and analyzed the effect of quantization on a classifier for the MNIST dataset. We demonstrate that, compared to our method, existing methods for the analysis of real-numbered networks often derive false conclusions about their quantizations, both when determining robustness and when detecting attacks, and that existing methods for quantized networks often miss attacks. Furthermore, we applied our method beyond robustness, showing how the number of bits in quantization enlarges the gender bias of a predictor for students’ grades
Understanding Chat Messages for Sticker Recommendation in Messaging Apps
Stickers are popularly used in messaging apps such as Hike to visually
express a nuanced range of thoughts and utterances to convey exaggerated
emotions. However, discovering the right sticker from a large and ever
expanding pool of stickers while chatting can be cumbersome. In this paper, we
describe a system for recommending stickers in real time as the user is typing
based on the context of the conversation. We decompose the sticker
recommendation (SR) problem into two steps. First, we predict the message that
the user is likely to send in the chat. Second, we substitute the predicted
message with an appropriate sticker. Majority of Hike's messages are in the
form of text which is transliterated from users' native language to the Roman
script. This leads to numerous orthographic variations of the same message and
makes accurate message prediction challenging. To address this issue, we learn
dense representations of chat messages employing character level convolution
network in an unsupervised manner. We use them to cluster the messages that
have the same meaning. In the subsequent steps, we predict the message cluster
instead of the message. Our approach does not depend on human labelled data
(except for validation), leading to fully automatic updation and tuning
pipeline for the underlying models. We also propose a novel hybrid message
prediction model, which can run with low latency on low-end phones that have
severe computational limitations. Our described system has been deployed for
more than months and is being used by millions of users along with hundreds
of thousands of expressive stickers
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Large Language Models (LLMs) are typically trained in two phases:
pre-training on large internet-scale datasets, and fine-tuning for downstream
tasks. Given the higher computational demand of pre-training, it's intuitive to
assume that fine-tuning adds less new information to the model, and is thus
more compressible. We explore this assumption by decomposing the weights of
fine-tuned models into their pre-trained components and an additional delta. We
introduce a simple method, BitDelta, which successfully quantizes this delta
down to 1 bit without compromising performance. This interesting finding not
only highlights the potential redundancy of information added during
fine-tuning, but also has significant implications for the multi-tenant serving
and multi-tenant storage of fine-tuned models. By enabling the use of a single
high-precision base model accompanied by multiple 1-bit deltas, BitDelta
dramatically reduces GPU memory requirements by more than 10x, which can also
be translated to enhanced generation latency in multi-tenant settings. We
validate BitDelta through experiments across Llama-2 and Mistral model
families, and on models up to 70B parameters, showcasing minimal performance
degradation over all tested settings
Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers
Quantization scale and bit-width are the most important parameters when
considering how to quantize a neural network. Prior work focuses on optimizing
quantization scales in a global manner through gradient methods (gradient
descent \& Hessian analysis). Yet, when applying perturbations to quantization
scales, we observe a very jagged, highly non-smooth test loss landscape. In
fact, small perturbations in quantization scale can greatly affect accuracy,
yielding a accuracy boost in 4-bit quantized vision transformers
(ViTs). In this regime, gradient methods break down, since they cannot reliably
reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to
effectively traverse the non-smooth landscape. Additionally, we propose using
an infoNCE loss, which not only helps combat overfitting on the small
calibration dataset ( images) but also makes traversing such a highly
non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully
quantized ViT-Base by , , and for -bit, -bit,
and -bit weight quantization levels. Extensive experiments on a variety of
CNN and ViT architectures further demonstrate its robustness in extreme
quantization scenarios. Our code is available at
https://github.com/enyac-group/evol-qComment: arXiv admin note: text overlap with arXiv:2211.0964
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Large Language Models (LLMs) from the GPT family have become extremely
popular, leading to a race towards reducing their inference costs to allow for
efficient local computation. Yet, the vast majority of existing work focuses on
weight-only quantization, which can reduce runtime costs in the memory-bound
one-token-at-a-time generative setting, but does not address them in
compute-bound scenarios, such as batched inference or prompt processing. In
this paper, we address the general quantization problem, where both weights and
activations should be quantized. We show, for the first time, that the majority
of inference computations for large generative models such as LLaMA, OPT, and
Falcon can be performed with both weights and activations being cast to 4 bits,
in a way that leads to practical speedups, while at the same time maintaining
good accuracy. We achieve this via a hybrid quantization strategy called QUIK,
which compresses most of the weights and activations to 4-bit, while keeping
some outlier weights and activations in higher-precision. The key feature of
our scheme is that it is designed with computational efficiency in mind: we
provide GPU kernels matching the QUIK format with highly-efficient layer-wise
runtimes, which lead to practical end-to-end throughput improvements of up to
3.4x relative to FP16 execution. We provide detailed studies for models from
the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate
inference using quantization plus 2:4 sparsity. Code is available at:
https://github.com/IST-DASLab/QUIK.Comment: 16 page
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
Post-training quantization (PTQ) has been gaining popularity for the
deployment of deep neural networks on resource-limited devices since unlike
quantization-aware training, neither a full training dataset nor end-to-end
training is required at all. As PTQ schemes based on reconstructing each layer
or block output turn out to be effective to enhance quantized model
performance, recent works have developed algorithms to devise and learn a new
weight-rounding scheme so as to better reconstruct each layer or block output.
In this work, we propose a simple yet effective new weight-rounding mechanism
for PTQ, coined FlexRound, based on element-wise division instead of typical
element-wise addition such that FlexRound enables jointly learning a common
quantization grid size as well as a different scale for each pre-trained
weight. Thanks to the reciprocal rule of derivatives induced by element-wise
division, FlexRound is inherently able to exploit pre-trained weights when
updating their corresponding scales, and thus, flexibly quantize pre-trained
weights depending on their magnitudes. We empirically validate the efficacy of
FlexRound on a wide range of models and tasks. To the best of our knowledge,
our work is the first to carry out comprehensive experiments on not only image
classification and natural language understanding but also natural language
generation, assuming a per-tensor uniform PTQ setting. Moreover, we
demonstrate, for the first time, that large language models can be efficiently
quantized, with only a negligible impact on performance compared to
half-precision baselines, achieved by reconstructing the output in a
block-by-block manner.Comment: Accepted to ICML 202
- …