159 research outputs found
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Large Language Models (LLMs) from the GPT family have become extremely
popular, leading to a race towards reducing their inference costs to allow for
efficient local computation. Yet, the vast majority of existing work focuses on
weight-only quantization, which can reduce runtime costs in the memory-bound
one-token-at-a-time generative setting, but does not address them in
compute-bound scenarios, such as batched inference or prompt processing. In
this paper, we address the general quantization problem, where both weights and
activations should be quantized. We show, for the first time, that the majority
of inference computations for large generative models such as LLaMA, OPT, and
Falcon can be performed with both weights and activations being cast to 4 bits,
in a way that leads to practical speedups, while at the same time maintaining
good accuracy. We achieve this via a hybrid quantization strategy called QUIK,
which compresses most of the weights and activations to 4-bit, while keeping
some outlier weights and activations in higher-precision. The key feature of
our scheme is that it is designed with computational efficiency in mind: we
provide GPU kernels matching the QUIK format with highly-efficient layer-wise
runtimes, which lead to practical end-to-end throughput improvements of up to
3.4x relative to FP16 execution. We provide detailed studies for models from
the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate
inference using quantization plus 2:4 sparsity. Code is available at:
https://github.com/IST-DASLab/QUIK.Comment: 16 page
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
The rising popularity of intelligent mobile devices and the daunting
computational cost of deep learning-based models call for efficient and
accurate on-device inference schemes. We propose a quantization scheme that
allows inference to be carried out using integer-only arithmetic, which can be
implemented more efficiently than floating point inference on commonly
available integer-only hardware. We also co-design a training procedure to
preserve end-to-end model accuracy post quantization. As a result, the proposed
quantization scheme improves the tradeoff between accuracy and on-device
latency. The improvements are significant even on MobileNets, a model family
known for run-time efficiency, and are demonstrated in ImageNet classification
and COCO detection on popular CPUs.Comment: 14 pages, 12 figure
A Quantization-Friendly Separable Convolution for MobileNets
As deep learning (DL) is being rapidly pushed to edge computing, researchers
invented various ways to make inference computation more efficient on
mobile/IoT devices, such as network pruning, parameter compression, and etc.
Quantization, as one of the key approaches, can effectively offload GPU, and
make it possible to deploy DL on fixed-point pipeline. Unfortunately, not all
existing networks design are friendly to quantization. For example, the popular
lightweight MobileNetV1, while it successfully reduces parameter size and
computation latency with separable convolution, our experiment shows its
quantized models have large accuracy gap against its float point models. To
resolve this, we analyzed the root cause of quantization loss and proposed a
quantization-friendly separable convolution architecture. By evaluating the
image classification task on ImageNet2012 dataset, our modified MobileNetV1
model can archive 8-bit inference top-1 accuracy in 68.03%, almost closed the
gap to the float pipeline.Comment: Accepted At THE 1ST WORKSHOP ON ENERGY EFFICIENT MACHINE LEARNING AND
COGNITIVE COMPUTING FOR EMBEDDED APPLICATIONS (EMC^2 2018
- …