42 research outputs found
MRQ:Support Multiple Quantization Schemes through Model Re-Quantization
Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU,
DPU), deploying deep learning models on edge devices with fixed-point hardware
is still challenging due to complex model quantization and conversion. Existing
model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and
Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g.,
only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep
learning models cannot be easily quantized for diverse fixed-point hardwares,
mainly due to slightly different quantization requirements. In this paper, we
envision a new type of model quantization approach called MRQ (model
re-quantization), which takes existing quantized models and quickly transforms
the models to meet different quantization requirements (e.g., asymmetric ->
symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much
simpler than quantizing from scratch because it avoids costly re-training and
provides support for multiple quantization schemes simultaneously. To minimize
re-quantization error, we developed a new set of re-quantization algorithms
including weight correction and rounding error folding. We have demonstrated
that MobileNetV2 QAT model [7] can be quickly re-quantized into two different
quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less
than 0.64 units of accuracy loss. We believe our work is the first to leverage
this concept of re-quantization for model quantization and models obtained from
the re-quantization process have been successfully deployed on NNA in the Echo
Show devices.Comment: 8 pages, 6 figures, 3 tables, TinyML Conferenc