Vision Transformers (ViTs) have achieved state-of-the-art performance on
various computer vision applications. These models, however, have considerable
storage and computational overheads, making their deployment and efficient
inference on edge devices challenging. Quantization is a promising approach to
reducing model complexity; unfortunately, existing efforts to quantize ViTs are
simulated quantization (aka fake quantization), which remains floating-point
arithmetic during inference and thus contributes little to model acceleration.
In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs,
to enable ViTs to perform the entire computational graph of inference with
integer operations and bit-shifting and no floating-point operations. In I-ViT,
linear operations (e.g., MatMul and Dense) follow the integer-only pipeline
with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and
LayerNorm) are approximated by the proposed light-weight integer-only
arithmetic methods. In particular, I-ViT applies the proposed Shiftmax and
ShiftGELU, which are designed to use integer bit-shifting to approximate the
corresponding floating-point operations. We evaluate I-ViT on various benchmark
models and the results show that integer-only INT8 quantization achieves
comparable (or even higher) accuracy to the full-precision (FP) baseline.
Furthermore, we utilize TVM for practical hardware deployment on the GPU's
integer arithmetic units, achieving 3.72~4.11× inference speedup
compared to the FP model