Weight oscillation is an undesirable side effect of quantization-aware
training, in which quantized weights frequently jump between two quantized
levels, resulting in training instability and a sub-optimal final model. We
discover that the learnable scaling factor, a widely-used de facto
setting in quantization aggravates weight oscillation. In this study, we
investigate the connection between the learnable scaling factor and quantized
weight oscillation and use ViT as a case driver to illustrate the findings and
remedies. In addition, we also found that the interdependence between quantized
weights in query and key of a self-attention layer makes
ViT vulnerable to oscillation. We, therefore, propose three techniques
accordingly: statistical weight quantization (StatsQ) to improve
quantization robustness compared to the prevalent learnable-scale-based method;
confidence-guided annealing (CGA) that freezes the weights with
high confidence and calms the oscillating weights; and
query-key reparameterization (QKR) to resolve the
query-key intertwined oscillation and mitigate the resulting gradient
misestimation. Extensive experiments demonstrate that these proposed techniques
successfully abate weight oscillation and consistently achieve substantial
accuracy improvement on ImageNet. Specifically, our 2-bit DeiT-T/DeiT-S
algorithms outperform the previous state-of-the-art by 9.8% and 7.7%,
respectively. Code and models are available at: https://github.com/nbasyl/OFQ.Comment: Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 202