180 research outputs found
A Quantization-Friendly Separable Convolution for MobileNets
As deep learning (DL) is being rapidly pushed to edge computing, researchers
invented various ways to make inference computation more efficient on
mobile/IoT devices, such as network pruning, parameter compression, and etc.
Quantization, as one of the key approaches, can effectively offload GPU, and
make it possible to deploy DL on fixed-point pipeline. Unfortunately, not all
existing networks design are friendly to quantization. For example, the popular
lightweight MobileNetV1, while it successfully reduces parameter size and
computation latency with separable convolution, our experiment shows its
quantized models have large accuracy gap against its float point models. To
resolve this, we analyzed the root cause of quantization loss and proposed a
quantization-friendly separable convolution architecture. By evaluating the
image classification task on ImageNet2012 dataset, our modified MobileNetV1
model can archive 8-bit inference top-1 accuracy in 68.03%, almost closed the
gap to the float pipeline.Comment: Accepted At THE 1ST WORKSHOP ON ENERGY EFFICIENT MACHINE LEARNING AND
COGNITIVE COMPUTING FOR EMBEDDED APPLICATIONS (EMC^2 2018
HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Model quantization is a widely used technique to compress and accelerate deep
neural network (DNN) inference. Emergent DNN hardware accelerators begin to
support mixed precision (1-8 bits) to further improve the computation
efficiency, which raises a great challenge to find the optimal bitwidth for
each layer: it requires domain experts to explore the vast design space trading
off among accuracy, latency, energy, and model size, which is both
time-consuming and sub-optimal. Conventional quantization algorithm ignores the
different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ)
framework which leverages the reinforcement learning to automatically determine
the quantization policy, and we take the hardware accelerator's feedback in the
design loop. Rather than relying on proxy signals such as FLOPs and model size,
we employ a hardware simulator to generate direct feedback signals (latency and
energy) to the RL agent. Compared with conventional methods, our framework is
fully automated and can specialize the quantization policy for different neural
network architectures and hardware architectures. Our framework effectively
reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with
negligible loss of accuracy compared with the fixed bitwidth (8 bits)
quantization. Our framework reveals that the optimal policies on different
hardware architectures (i.e., edge and cloud architectures) under different
resource constraints (i.e., latency, energy and model size) are drastically
different. We interpreted the implication of different quantization policies,
which offer insights for both neural network architecture design and hardware
architecture design.Comment: CVPR 2019. The first three authors contributed equally to this work.
Project page: https://hanlab.mit.edu/projects/haq
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
The rising popularity of intelligent mobile devices and the daunting
computational cost of deep learning-based models call for efficient and
accurate on-device inference schemes. We propose a quantization scheme that
allows inference to be carried out using integer-only arithmetic, which can be
implemented more efficiently than floating point inference on commonly
available integer-only hardware. We also co-design a training procedure to
preserve end-to-end model accuracy post quantization. As a result, the proposed
quantization scheme improves the tradeoff between accuracy and on-device
latency. The improvements are significant even on MobileNets, a model family
known for run-time efficiency, and are demonstrated in ImageNet classification
and COCO detection on popular CPUs.Comment: 14 pages, 12 figure
Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence
Energy efficiency is critical for running computer vision on battery-powered systems, such as mobile phones or UAVs (unmanned aerial vehicles, or drones). This book collects the methods that have won the annual IEEE Low-Power Computer Vision Challenges since 2015. The winners share their solutions and provide insight on how to improve the efficiency of machine learning systems
- …