Hardware-Centric AutoML for Mixed-Precision Quantization by Wang, Kuan et al.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-020-01339-6
Hardware-Centric AutoML for Mixed-Precision Quantization
Kuan Wang∗ · Zhijian Liu∗ · Yujun Lin∗ · Ji Lin · Song Han
Received: date / Accepted: date
Abstract Model quantization is a widely used technique to
compress and accelerate deep neural network (DNN) infer-
ence. Emergent DNN hardware accelerators begin to sup-
port mixed precision (1-8 bits) to further improve the com-
putation efficiency, which raises a great challenge to find the
optimal bitwidth for each layer: it requires domain experts to
explore the vast design space trading off accuracy, latency,
energy, and model size, which is both time-consuming and
usually sub-optimal. There are plenty of specialized hard-
ware accelerators for neural networks, but little research has
been done to design specialized neural networks optimized
for a particular hardware accelerator. The latter is demand-
ing given the much longer design cycle of silicon than neural
nets. Conventional quantization algorithms ignore the dif-
ferent hardware architectures and quantize all the layers in
a uniform way. In this paper, we introduce the Hardware-
Aware Automated Quantization (HAQ) framework that au-
tomatically determines the quantization policy, and we take
the hardware accelerator’s feedback in the design loop. Rather
than relying on proxy signals such as FLOPs and model
Kuan Wang∗
Massachusetts Institute of Technology
E-mail: kuanwang@mit.edu
Zhijian Liu∗
Massachusetts Institute of Technology
E-mail: zhijian@mit.edu
Yujun Lin∗
Massachusetts Institute of Technology
E-mail: yujunlin@mit.edu
Ji Lin
Massachusetts Institute of Technology
E-mail: jilin@mit.edu
Song Han
Massachusetts Institute of Technology
E-mail: songhan@mit.edu
∗ indicates equal contributions.
Critic
Actor
Agent: DDPG
……
……
Environment: HW Architecture-aware Quantization
Bit-serial HW Accelerator
Action
1 1 1 0 1 0 1 0 0 1 0 1 0
1 1 1 0 1 0 1 0 0 1
1 1 1 0 1 0 1 0 0 1 0
Layer t-1 
6b/7b
Layer t 
4b/6b
Layer t+1 
5b/6b
State  
Reward
Latency
Hardware 
Mapping
5 bit weight 6 bit activation
CNN Quantizer
PE
PE PE
PEPE
PE
⋯ ⋯⋯
&
<<
Cycle 0 (MSB)
Cycle T (LSB)
+
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
2
s=1
s=2
AgentState Action
5bit
6bitKeep the same
Change stride embedding only Increased
Hardware
3bit
4bit
5bit
5bit
6bit
AutoML
Quantization
id in out k s f p dw w a
id in out k s f p dw w a
Hardware
4bit
3bit
5bit
5bit
6bit
AutoML
Quantization
 1
Fig. 1: Hardware-centric (right) automated ML (left) for
mixed-precision quantization (middle).
size, we employ a hardware simulator to generate the direct
feedback signals to the RL agent. Compared with conven-
tional methods, our framework is fully automated and can
provide specialized quantization policy for different nerual
network and hardware architectures. The learned policy can
transfer very well between different neural net architectures.
Our framework effectively reduced the latency by 1.4-1.95×
and the energy consumption by 1.9× with negligible loss of
accuracy compared with the fixed bitwidth (8 bits) quanti-
zation. Our framework reveals that the optimal policies on
different hardware architectures (i.e., edge and cloud archi-
tectures) under different resource constraints (i.e., latency,
energy, and model size) are drastically different. We inter-
preted the implication of different quantizat on policies, which
offer insights for both neural network architecture design
and hardware architecture design.
Keywords Model Quantization · Mixed-Precision ·
Automated ML · Hardware
1 Introduction
In many real-time machine learning applications (such as
robotics, autonomous driving, and mobile VR/AR), deep neu-
ar
X
iv
:2
00
8.
04
87
8v
1 
 [c
s.C
V]
  1
1 A
ug
 20
20
2 International Journal of Computer Vision
68
69
70
71
72
73
25 44 63 82 101 120
MobileNets (fixed 8-bit quantization)
MobileNets (our flexible-bit quantization)
Latency (ms)
To
p-
1 
A
cc
ur
ac
y 
(%
)
1MB 2MB 3MB
Model Size:
Fig. 2: We need different number of bits for different lay-
ers. We quantize MobileNets (Howard et al 2017) to dif-
ferent number of bits (both weights and activations), and it
lies on a better pareto curve (yellow) than fixed-bit quanti-
zation (blue). This is because different layers have different
redundancy and have different operation intensities (opera-
tions per byte) on the hardware, which advocates for using
flexible bitwidths for different layers.
ral networks is strictly constrained by the latency, energy,
and model size. In order to improve the hardware efficiency,
many researchers have proposed to directly design efficient
models (Sandler et al 2018; Howard et al 2017; Cai et al
2019) or to quantize the weights and activations to low pre-
cision (Han et al 2016; Zhu et al 2017).
Conventional quantization methods use the same num-
ber of bits for all layers (Choi et al 2018; Jacob et al 2018),
but as different layers have different redundancy and be-
have differently on the hardware (computation bounded or
memory bounded), it is necessary to use flexible bitwidths
for different layers (as shown in Figure 2). This flexibil-
ity was originally not supported by chip vendors until re-
cently the hardware manufacturers started to implement this
feature: Apple released the A12 Bionic chip that supports
flexible bits for the neural network inference (Apple 2018);
NVIDIA recently introduced the Turing GPU architecture
that supports 1-bit, 4-bit, 8-bit and 16-bit arithmetic opera-
tions (Nvidia 2018); Imagination launched a flexible neural
network IP that supports per-layer bitwidth adjustment for
both weights and activations (Imagination 2018). Besides
industry, recently academia also works on the bit-level flex-
ible hardware design: BISMO (Umuroglu et al 2018) pro-
posed the bit-serial multiplier to support multiplications of
1 to 8 bits; BitFusion (Sharma et al 2018) supports multipli-
cations of 2, 4, 8 and 16 bits in a spatial manner.
A very important missing part is, however, how to de-
termine the bitwidth of both weights and activations for
each layer on different hardware accelerators. This is a
vast design space: with M different neural network models,
each with N layers, on H different hardware platforms, there
Inference latency on
HW1 HW2 HW3
Best Q. policy for HW1 16.29 ms 85.24 ms 117.44 ms
Best Q. policy for HW2 19.95 ms 64.29 ms 108.64 ms
Best Q. policy for HW3 19.94 ms 66.15 ms 99.68 ms
Table 1: Inference latency of MobileNet-V1 (Howard et al
2017) on three hardware architectures under different quan-
tization policies. The quantization policy that is optimized
for one hardware is not optimal for the other. This sug-
gests we need a specialized quantization solution for differ-
ent hardware architectures. (HW1: BitFusion (Sharma et al
2018), HW2: BISMO (Umuroglu et al 2018) edge accelera-
tor, HW3: BISMO cloud accelerator, batch = 16).
are in total O(H×M×82N) possible solutions (Here, we as-
sume that the bitwidth is between 1 to 8 for both weights and
activations). For a widely used ResNet-50 (He et al 2016)
model, the size of the search space is about 8100, which is
even larger than the number of particles in the universe. Con-
ventional methods require domain experts (with knowledge
of both machine learning and hardware architecture) to ex-
plore the huge design space smartly with rule-based heuris-
tics. For instance, we should retain more bits in the first layer
which extracts low level features and in the last layer which
computes the final outputs; also, we should use more bits
in the convolution layers than in the fully-connected layers
because empirically, the convolution layers are more sensi-
tive. As the neural network becomes deeper, the exploration
space increases exponentially, which makes it infeasible to
rely on hand-crafted strategies. Therefore, these rule-based
quantization policies are usually sub-optimal, and they can-
not generalize well from one model to another. In this pa-
per, we would like to automate this exploration process by a
learning-based framework.
Another challenge is how to measure the latency and
the energy consumption of a given model on the hardware.
A widely adopted approach (Howard et al 2017; Sandler
et al 2018) is to rely on some proxy signals (e.g., FLOPs,
number of memory references). However, as different hard-
ware behaves very differently, the performance of a model
on the hardware cannot always be accurately reflected by
these proxy signals. Therefore, it is of great importance to
directly involve the hardware architecture into the loop. Also,
as demonstrated in Table 1, the quantization solution opti-
mized on one hardware might not be optimal on the other,
which raises the demand for specialized policies for differ-
ent hardware architectures.
To this end, we propose the Hardware-Aware Automated
Quantization (HAQ) framework that leverages reinforcement
learning to automatically predict the quantization policy given
International Journal of Computer Vision 3
the hardware’s feedback. The RL agent decides the bitwidth
of a given neural network in a layer-wise manner. For each
layer, the agent receives the layer configuration and statis-
tics as observation, and it then outputs the action which is
the bitwidth of weights and activations. We then leverage the
hardware accelerator as the environment to obtain the direct
feedback from hardware to guide the RL agent to satisfy the
resource constraints. After all layers are quantized, we fine-
tune the quantized model for one more epoch, and feed the
validation accuracy after short-term retraining as the reward
signal to our RL agent. During the exploration, we lever-
age the deep deterministic policy gradient (DDPG) (Lilli-
crap et al 2016) to supervise our RL agent. We studied the
quantization policy on multiple hardware architectures: both
cloud and edge neural network accelerators, with spatial or
temporal multi-precision design.
The contribution of this paper has four aspects:
1. Automation: We propose an automated framework for
quantization, which does not require domain experts and
rule-based heuristics. It frees the human labor from ex-
ploring the vast search space of choosing bitwidths.
2. Hardware-Aware: Our framework integrates the hard-
ware architecture into the loop so that it can directly re-
duce the latency, energy and storage on the target hard-
ware instead of relying on some proxy signals.
3. Specialization: For different hardware architectures, our
framework can offer a specialized quantization policy
that’s exactly tailored for the hardware architecture.
4. Design Insights: We interpreted the different quantiza-
tion polices learned for different hardware architectures.
Taking both computation and memory access into ac-
count, the interpretation offers insights on both neural
network architecture and hardware architecture design.
2 Related Work
Quantization. There have been extensive explorations on
compressing and accelerating deep neural networks using
quantization. Han et al (2016) quantized the network weights
to reduce the model size by rule-based strategies: e.g., they
used human heuristics to determine the bitwidths for convo-
lution and fully-connected layers. Courbariaux et al (2016)
binarized the network weights into {−1,+1}; Rastegari et al
(2016) and Zhou et al (2018) binarized each convolution fil-
ter into {−w,+w}; Zhu et al (2017) mapped the network
weights into {−wN,0,+wP} using two bits; Zhou et al (2016)
used one bit for network weights and two bits for activa-
tions; Jacob et al (2018) made use of 8-bit integers for both
weights and activations. We refer the reader to the survey pa-
per by Krishnamoorthi (2018) for a more detailed overview.
These conventional quantization methods either simply as-
sign the same number of bits to all layers or require do-
main experts to determine the bitwidths for different layers,
while our framework automates this design process, and our
learning-based policy outperforms rule-based strategies.
Automated ML. Many researchers aimed to improve the
performance of deep neural networks by searching the net-
work architectures: Zoph and Le (2017) proposed the Neu-
ral Architecture Search (NAS) to explore and design the
transformable network building blocks, and their network
architecture outperforms several human designed networks;
Liu et al (2018) introduced the Progressive NAS to accel-
erate the architecture search by 5× using sequential model-
based optimization; Pham et al (2018) introduced the Effi-
cient NAS to speed up the exploration by 1000× using pa-
rameter sharing; Cai et al (2018) introduced the path-level
network transformation to search the tree-structured archi-
tecture space effectively. Motivated by these AutoML frame-
works, He et al (2018) leveraged the reinforcement learn-
ing to automatically prune the convolution channels. Our
framework further explores the automated quantization for
network weights and activations, and it takes the hardware
architectures into consideration.
Efficient Models. To facilitate the efficient deployment, re-
searchers designed hardware-friendly approaches to slim neu-
ral network models. For instance, the coarse-grained chan-
nel pruning methods (He et al 2017; Liu et al 2017) prune
away the entire channel of convolution kernels to achieve
speedup. Recently, researchers have explicitly optimized for
various aspects of hardware properties, including the infer-
ence latency and energy: Yang et al (2016) proposed the
energy-aware pruning to directly optimize the energy con-
sumption of neural networks; Yang et al (2018) reduced the
inference time of neural networks on the mobile devices
through a lookup table. Nevertheless, these methods are still
rule-based and mostly focus on pruning. Our framework au-
tomates the quantization process by taking hardware-specific
metric as direct rewards using a learning based method.
3 Approach
The overview of our proposed framework is in Figure 3.
We model the quantization task as a reinforcement learning
problem. We used the actor critic model with DDPG agent
to give action: bits for each layer. We collect hardware coun-
ters, together with accuracy as direct rewards to search the
optimal quantization policy for each layer. We have three
hardware environments that covers edge and cloud, spatial
and temporal architectures for multi-precision accelerator.
Below describes the details of the RL formulation.
4 International Journal of Computer Vision
BitFusion (On the Edge)
PE
&
<<
+
⋯
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
PE
PEPEPE
PE ⋯
BISMO (On the Cloud)
PE
&
<<
+
⋯
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
PE
PEPEPE
PE ⋯
BitFusion (On the Edge)
PE
&
<<
+
⋯
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
PE
PEPEPE
PE ⋯
BISMO (On the Cloud)
PE
&
<<
+
⋯
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
PE
PEPEPE
PE ⋯
Critic
Actor
Agent: DDPG
Action
State  
Reward
Direct 
Feedback
Hardware 
Mapping
3 bit weight 5 bit activation
1 0 1 0 0 0 1 0
1 1 1 0 1 0 1 0 0 1 0 1 0
1 1 1 0 1 0 1 0 0 1
1 1 1 0 1 0 1 0 0 1 0
……
……
Quantized Model
…
Layer 3 
3bit / 5bit
Layer 4 
6bit / 7bit
Layer 5 
4bit / 6bit
Layer 6 
5bit / 6bit
Hardware AcceleratorPolicy
BISMO (On the Edge)
PE
&
<<
Cycle 0 
(MSB)
Cycle T 
(LSB)
+
⋯
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
PE
PEPEPE
PE ⋯
Critic
Actor
Agent: DDPG
Action
State  
Reward
Direct 
Feedback
Hardware 
Mapping
3 bit weight 5 bit activation
1 0 1 0 0 0 1 0
1 1 1 0 1 0 1 0 0 1 0 1 0
1 1 1 0 1 0 1 0 0 1
1 1 1 0 1 0 1 0 0 1 0
……
……
Quantized Model
…
Layer 3 
3bit / 5bit
Layer 4 
6bit / 7bit
Layer 5 
4bit / 6bit
Layer 6 
5bit / 6bit
Hardware AcceleratorPolicy
BISMO (On the Edge)
PE
&
<<
Cycle 0 
(MSB)
Cycle T 
(LSB)
+
⋯
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
PE
PEPEPE
PE ⋯
 1
Fig. 3: An overview of our Hardware-Aware Automated Quantization (HAQ) framework. We leverage the reinforcement
learning to automatically search over the huge quantization design space with hardware in the loop. The agent propose an
optimal bitwidth allocation policy given the amount of computation resources (i.e., latency, power, and model size). Our RL
agent integrates the hardware accelerator into the exploration loop so that it can obtain the direct feedback from the hardware,
instead of relying on indirect proxy signals.
3.1 Observation (State Space)
Our agent processes the neural network in a layer-wise man-
ner. For each layer, our agent takes two steps: one for weights,
and one for activations. In this paper, we introduce a ten-
dimensional feature vector Ok as our observation:
If the kth layer is a convolution layer, the state Ok is
Ok = (k,cin,cout,skernel,sstride,sfeat,nparams, idw, iw/a,ak−1),
(1)
where k is the layer index, cin is #input channels, cout is #out-
put channels, skernel is kernel size, sstride is the stride, sfeat is
the input feature map size, nparams is #parameters, idw is a
binary indicator for depthwise convolution, iw/a is a binary
indicator for weight/activation, and ak−1 is the action from
the last time step.
If the kth layer is a fully-connected layer, the state Ok is
Ok = (k,hin,hout,1,0,sfeat,nparams,0, iw/a,ak−1), (2)
where k is the layer index, hin is #input hidden units, hout is
#output hidden units, sfeat is the size of input feature vector,
nparams is #parameters, iw/a is a binary indicator for weight/
activation, and ak−1 is the action from the last step.
For each dimension in the observation vector Ok, we nor-
malize it into [0,1] to make them in the same scale.
3.2 Action Space
We use a continuous action space to determine the bitwidth.
The reason that we do not use a discrete action space is be-
cause it loses the relative order: e.g., 2-bit quantization is
more aggressive than 4-bit and even more than 8-bit. At the
kth time step, we take the continuous action ak (which is in
the range of [0,1]), and round it into the discrete bitwidth
value bk:
bk = round(bmin−0.5+ak× (bmax−bmin+1)), (3)
where bmin and bmax denote the min and max bitwidth (in
our experiments, we set bmin to 2 and bmax to 8).
Resource Constraints. In real-world applications, we have
limited computation budgets (i.e., latency, energy, and model
size). We would like to find the quantization policy with the
best performance given the constraint.
We encourage our agent to meet the computation budget
by limiting the action space. After our RL agent gives ac-
tions {ak} to all layers, we measure the amount of resources
that will be used by the quantized model. The feedback is
directly obtained from the hardware accelerator, which we
will discuss in Section 3.3. If the current policy exceeds
our resource budget (on latency, energy or model size), we
will sequentially decrease the bitwidth of each layer until
the constraint is finally satisfied.
International Journal of Computer Vision 5
3.3 Direct Feedback from Hardware Accelerators
An intuitive feedback to our RL agent can be FLOPs or the
model size. However, as these proxy signals are indirect,
they are not equal to the performance (i.e., latency, energy
consumption) on the hardware. Cache locality, number of
kernel calls, memory bandwidth all matters. Proxy feedback
can not model these hardware functionality to find the spe-
cialized strategies (see Table 1).
Instead, we use direct latency and energy feedback from
hardware accelerators to optimize the performance. In simu-
lators, the latency is approximated as the sum of the compu-
tation time, the stall caused by the memory access and some
other overheads:
T = Tcomputation+Tstall+Toverhead, (4)
and the energy consumption is modeled as the sum of the
logic circuits and memory:
E = Ememory access per bit×Stotal memory access size
+Pdynamic×Texecution.
(5)
The direct feedback from the hardware simulator is very
crucial as it enables our RL agent to determine the bitwidth
allocation policy from the subtle differences between differ-
ent layers: e.g., the vanilla convolution has more data reuse
and better locality, while the depthwise convolution (Chol-
let 2017) has less reuse and worse locality, which makes it
memory bounded.
3.4 Quantization
We linearly quantize the weights and activations of each
layer using the action ak given by our RL agent, as linearly
quantized model only need fixed point arithmetic unit which
is more efficient to implement on the hardware than the k-
means quantization.
Specifically, for each weight value w in the kth layer, we
first truncate it into the range of [−c,c], and we then quantize
it linearly into ak bits:
quantize(w,ak,c) = round(clamp(w,c)/s)× s, (6)
where clamp(·,x) is to truncate the values into [−x,x], and
the scaling factor s is defined as
s = c/(2ak−1−1). (7)
In this paper, we choose the value of c by finding the opti-
mal value x that minimizes the KL-divergence between the
original weight distribution Wk and the quantized weight dis-
tribution quantize(Wk,ak,x):
c = argmin
x
DKL(Wk || quantize(Wk,ak,x)), (8)
where DKL(· || ·) is the KL-divergence that characterizes the
distance between two distributions. As for activations, we
quantize the values similarly except that we truncate them
into the range of [0,c], not [−c,c] since the activation values
(which are the outputs of the ReLU layers) are non-negative.
This calibration based on KL-divergence enables us to make
use of the pretrained models rather than training the models
from scratch so that it can reduce the training time signifi-
cantly. As for the overhead, we only use 64 images to cali-
brate once at the beginning, which is negligible compared to
the whole training.
3.5 Reward Signal
After quantization, we retrain the quantized model for one
more epoch to recover the performance. As we impose the
resource constraints by limiting the action space, we define
our reward function R to be only related to the accuracy:
R = λ × (accuracyquant− accuracyorigin), (9)
where accuracyorigin is the top-1 classification accuracy of
the full-precision model on the training set, accuracyquant is
the top-1 classification accuracy of the quantized model after
finetuning, and λ is a scaling factor which is set to 0.1 in our
experiments.
3.6 Agent
In our environment, one step means that our agent makes an
action to decide the number of bits assigned to the weights
or activations of a specific layer, while one episode is com-
posed of multiple steps, where our RL agent makes actions
to all layers. As for our RL agent, we leverage the deep de-
terministic policy gradient (DDPG) (Lillicrap et al 2016),
which is an off-policy actor-critic algorithm for continuous
control problem. We apply a variant form of the Bellman’s
Equation, where each transition in an episode is defined as
Tk = (Ok,ak,R,Ok+1). (10)
During exploration, the Q-function is computed as
Qˆk = Rk−B+ γ×Q(Ok+1,w(Ok+1) | θQ), (11)
and the gradient signal can be approximated using
L =
1
Ns
Ns
∑
k=1
(Qˆk−Q(Ok,ak | θQ))2, (12)
where Ns denotes the number of steps in this episode, and
the baseline B is defined as an exponential moving average
of all previous rewards in order to reduce the variance of the
gradient estimation. The discount factor γ is set to 1 since we
assume that the action made for each layer should contribute
6 International Journal of Computer Vision
equally to the final result. Moreover, as the number of steps
is always finite (bounded by the number of layers), the sum
of the rewards will not explode.
3.7 Implementation Details
In this section, we present some implementation details about
the RL agent, exploration and finetuning quantized models.
Agent. The DDPG agent consists of an actor network and
a critic network. Both follow the same network architecture:
each network has 3 fully-connected layers with the hidden
size of [400,300]. For the actor network, the input is the state
vector, and the output action is normalized to [0,1] by the
sigmoid function; while for the critic network, the input is
a vector concatenated by state and its corresponding action
produced by actor.
Exploration. Optimization of the DDPG agent is carried
out using ADAM (Kingma and Ba 2015) with β1 = 0.9 and
β2 = 0.999. We use a fixed learning rate of 10−4 for the actor
network and 10−3 for the critic network. During exploration,
we employ the following stochastic process of the noise:
w′(Ok)∼ Ntrunc(w(Ok | θwk ),σ2,0,1), (13)
where Ntrunc(µ,σ ,a,b) is the truncated normal distribution,
and w is the model weights. The noise σ is initialized as 0.5,
and after each episode, the noise is decayed exponentially
with a decay rate of 0.99.
Finetuning. During exploration, we finetune the quantized
model for one epoch to help recover the performance (us-
ing SGD with a fixed learning rate of 10−3 and momen-
tum of 0.9). We randomly select 100 categories from Im-
ageNet (Deng et al 2009) to accelerate the model finetun-
ing during exploration. After exploration, we quantize the
model with our best policy and finetune it on the full dataset.
4 Experiments
We conduct extensive experiments to demonstrate the con-
sistent effectiveness of our framework for multiple objec-
tives: latency, energy, model size, and accuracy.
Datasets and Models. Our experiments are performed on
the ImageNet (Deng et al 2009) dataset. As our focus is on
more efficient models, we extensively study the quantiza-
tion of MobileNet-V1 (Howard et al 2017) and MobileNet-
V2 (Sandler et al 2018). Both MobileNets are inspired from
the depthwise separable convolutions (Chollet 2017) and re-
place the regular convolutions with the pointwise and depth-
wise convolutions: MobileNet-V1 stacks multiple “depth-
wise – pointwise” blocks repeatedly; while MobileNet-V2
Hardware Batch PE Array AXI port Block RAM
Edge Zynq-7020 1 8×8 4×64b 140×36Kb
Cloud VU9P 16 16×16 4×256b 2160×36Kb
Table 2: Configurations of edge and cloud accelerators.
uses the “pointwise – depthwise – pointwise” blocks as its
basic building primitives.
4.1 Latency-Constrained Quantization
We first evaluate our framework under latency constraints on
two representative hardware architectures: spatial and tem-
poral architectures for multi-precision CNN:
Temporal Architecture. Bit-Serial Matrix Multiplication
Overlay (BISMO) proposed by Umuroglu et al (2018) is
a classic temporal design of neural network accelerator on
FPGA. It introduces bit-serial multipliers which are fed with
one-bit digits from 256 weights and corresponding activa-
tions in parallel at one time and accumulates their partial
products by shifting over time.
Spatial Architecture. BitFusion architecture proposed by
Sharma et al (2018) is a state-of-the-art spatial ASIC design
for neural network accelerator. It employs a 2D systolic ar-
ray of Fusion Units which spatially sum the shifted partial
products of two-bit elements from weights and activations.
4.1.1 Quantization policy for BISMO Architecture
Inferencing the neural networks on edge devices and cloud
severs can be quite different, since the tasks on the cloud
servers are more intensive and the edge devices are usually
limited to low computation resources and memory band-
width. We use Xilinx Zynq-7020 FPGA (Xilinx 2018b) as
our edge device and Xilinx VU9P (Xilinx 2018a) as our
cloud device. Table 2 shows our experiment configurations
on these two platforms along with their available resources.
As for comparison, we adopt the PACT (Choi et al 2018)
as our baseline, which uses the same number of bits for all
layers except for the first layer which extracts the low level
features, they use 8 bits for both weights and activations as
it has fewer parameters and is very sensitive to errors. We
follow a similar setup as PACT: we quantize the weights and
activations of the first and last layer to 8 bits and explore the
bitwidth allocation policy for all the other layers.
Under the same latency, HAQ consistently achieved bet-
ter accuracy than the baseline on both the cloud and the edge
(Table 3). With similar accuracy, HAQ can reduce the la-
tency by 1.4× to 1.95× compared with the baseline.
International Journal of Computer Vision 7
Edge Accelerator Cloud Accelerator
MobileNet-V1 MobileNet-V2 MobileNet-V1 MobileNet-V2
Bitwidths Acc.-1 Acc.-5 Latency Acc.-1 Acc.-5 Latency Acc.-1 Acc.-5 Latency Acc.-1 Acc.-5 Latency
PACT 4 bits 62.44 84.19 45.45 ms 61.39 83.72 52.15 ms 62.44 84.19 57.49 ms 61.39 83.72 74.46 ms
Ours flexible 67.40 87.90 45.51 ms 66.99 87.33 52.12 ms 65.33 86.60 57.40 ms 67.01 87.46 73.97 ms
PACT 5 bits 67.00 87.65 57.75 ms 68.84 88.58 66.94 ms 67.00 87.65 77.52 ms 68.84 88.58 99.43 ms
Ours flexible 70.58 89.77 57.70 ms 70.90 89.91 66.92 ms 69.97 89.37 77.49 ms 69.45 88.94 99.07 ms
PACT 6 bits 70.46 89.59 70.67 ms 71.25 90.00 82.49 ms 70.46 89.59 99.86 ms 71.25 90.00 127.07 ms
Ours flexible 71.20 90.19 70.35 ms 71.89 90.36 82.34 ms 71.20 90.08 99.66 ms 71.85 90.24 127.03 ms
Original 8 bits 70.82 89.85 96.20 ms 71.81 90.25 115.84 ms 70.82 89.85 151.09 ms 71.81 90.25 189.82 ms
Table 3: Latency-constrained quantization on BISMO (edge and cloud accelerator) on ImageNet. Our framework can reduce
the latency by 1.4× to 1.95× with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.
Table 1
V1-edge-w V1-edge-a V1-cloud-w V1-cloud-a
5 5 6 6
8 8 5 5
7 4 8 5
7 7 6 4
6 4 5 4
7 7 6 4
6 4 6 4
7 7 7 4
6 4 5 4
7 7 7 4
6 4 4 4
7 7 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 7 7 4
7 4 4 4
7 7 7 4
7 4 5 4
7 6 7 7
7 6 6 6
7 5 7 7
7 7 8 8
Table 2
V1-edge-w edge-a DW-weight-bit PW-weight-bit V1-edge-a DW-activation-bit PW-activation-bit Cloud-w cloud-a DW-weight-bit
2 5 2 -3 0 5 2 3 0 6 6 -4
3 2 8 0 -6 2 8 0 6 5 5 0
4 7 2 -5 0 4 2 2 0 7 5 -5
5 2 7 0 -5 2 7 0 5 6 5 0
6 6 2 -4 0 4 2 2 0 6 5 -4
7 2 7 0 -5 2 7 0 5 5 5 0
8 6 2 -4 0 4 2 2 0 6 5 -4
9 2 7 0 -5 2 7 0 5 5 5 0
10 6 2 -4 0 4 2 2 0 6 5 -4
11 2 7 0 -5 2 7 0 5 5 5 0
12 6 2 -4 0 4 2 2 0 6 5 -4
13 2 7 0 -5 2 7 0 5 5 5 0
14 7 2 -5 0 4 2 2 0 6 4 -4
15 2 7 0 -5 2 6 0 4 5 5 0
16 7 2 -5 0 4 2 2 0 6 4 -4
17 2 7 0 -5 2 6 0 4 5 5 0
18 7 2 -5 0 4 2 2 0 6 4 -4
19 2 7 0 -5 2 6 0 4 5 5 0
20 7 2 -5 0 4 2 2 0 6 4 -4
21 2 7 0 -5 2 7 0 5 5 5 0
22 7 2 -5 0 4 2 2 0 7 4 -5
23 2 7 0 -5 2 7 0 5 5 5 0
24 7 2 -5 0 4 2 2 0 7 5 -5
25 2 7 0 -5 2 6 0 4 5 5 0
26 7 2 -5 0 6 2 4 0 8 4 -6
27 2 7 0 -5 2 5 0 3 5 4 0
2 2 -2 -2 4 5 0
Table 3
V2-edge-w V2-edge-a V2-cloud-w V2-cloud-a
5 5 5 5
3 5 4 7
3 5 4 7
4 5 6 6
4 7 4 6
5 7 4 5
7 4 5 3
4 7 4 5
4 7 5 5
7 6 5 5
6 7 4 5
5 7 5 5
7 4 6 3
6 7 4 5
5 7 5 5
7 5 6 4
6 7 5 5
5 7 5 5
7 6 5 5
6 7 5 5
6 7 6 6
7 5 8 3
6 7 5 5
6 7 6 6
7 5 8 4
6 7 5 6
6 7 6 5
7 5 8 4
7 7 5 5
6 7 6 6
7 5 8 4
7 7 5 6
6 7 6 6
7 5 8 4
6 7 5 6
6 7 6 6
7 4 8 4
6 6 5 6
6 6 6 6
6 6 8 6
6 6 6 6
6 6 7 7
6 6 8 5
6 6 5 7
6 6 7 8
6 6 8 5
6 6 5 7
6 6 7 8
#weight bit (pointwise) #weight bit (depthwise)
#activation bit (pointwise) #activation bit (depthwise)
8
6
4
2
4
6
6
4
2
4
6
8
Table 4
V2-edge-w V2-edge-w #weight bit 
(depthwise)
#weight bit 
(pointwise)
V2-edge-a V2-edge-a #activation bit 
(depthwise)
#activation bit 
(pointwise)
V2-cloud-w V2-cloud-w #weight bit 
(depthwise)
#weight bit 
(pointwise)
2 5 5 -3 0 5 5 3 0 4 4 -2 0
3 3 3 0 -1 6 6 0 4 5 5 0 -3
4 3 3 0 -1 6 6 0 4 5 5 0 -3
5 6 6 -4 0 5 5 3 0 4 4 -2 0
6 4 4 0 -2 7 7 0 5 4 4 0 -2
7 4 4 0 -2 7 7 0 5 4 4 0 -2
8 7 7 -5 0 4 4 2 0 4 4 -2 0
9 4 4 0 -2 7 7 0 5 4 4 0 -2
10 4 4 0 -2 7 7 0 5 4 4 0 -2
11 7 7 -5 0 6 6 4 0 4 4 -2 0
12 5 5 0 -3 7 7 0 5 4 4 0 -2
13 5 5 0 -3 7 7 0 5 4 4 0 -2
14 7 7 -5 0 5 5 3 0 5 5 -3 0
15 6 6 0 -4 7 7 0 5 4 4 0 -2
16 5 5 0 -3 7 7 0 5 4 4 0 -2
17 7 7 -5 0 5 5 3 0 5 5 -3 0
18 6 6 0 -4 7 7 0 5 4 4 0 -2
19 6 6 0 -4 7 7 0 5 4 4 0 -2
20 7 7 -5 0 6 6 4 0 5 5 -3 0
21 6 6 0 -4 7 7 0 5 4 4 0 -2
22 6 6 0 -4 7 7 0 5 5 5 0 -3
23 7 7 -5 0 5 5 3 0 6 6 -4 0
24 7 7 0 -5 7 7 0 5 4 4 0 -2
25 6 6 0 -4 7 7 0 5 5 5 0 -3
26 7 7 -5 0 5 5 3 0 6 6 -4 0
27 7 7 0 -5 7 7 0 5 4 4 0 -2
#b
it
6
4
2
4
6
#b
it
6
4
2
4
6
# OPs per Byte (pointwise) # OPs per Byte (depthwise)
#b
it
Layer index
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
lo
g 
#
Layer index
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
lo
g#
   
   
   
#b
its
#params (pointwise)  #params (depthwise) #weight bits (pointwise) #weight bits (depthwise)
6
4
2
4
6
lo
g#
Layer index
2 5 8 11 14 17 20 23 26
Edge
Cloud
MobileNet-V2 OPs per Byte
2
depthwise: fewer bits
0
4 
2
pointwise:more bits
depthwise:more bits pointwise:fewer bits
DW:less bits PW:more bits
depthwise:more bits pointwise:fewer bits
depthwise:fewer bits at first few layers
depthwise:more bits at last few layers
depthwise:fewer bits pointwise:more bits
depthwise:more bits pointwise:fewer bits more params, fewer bits
layer
downsample
Edge
Cloud
layer
layer
layer
layer
layer
#b
it
#b
it
lo
g#
 1
Fig. 4: Quantization policy under latency constraints for
MobileNet-V1 on BISMO (57.7 ms for the edge accelera-
tor and 77.5 ms for the cloud accelerator). On edge accel-
erator, our agent allocates fewer activation bits to the depth-
wise convolutions, which echos that the depthwise convolu-
tions are memory bounded and the activations dominates the
memory access. On cloud accelerator, our agent allocates
more bits to the depthwise convolutions and allocates fewer
bits to the pointwise convolutions, as cloud device has more
memory bandwidth and higher parallelism, the network ap-
pears to be computation bounded.
Interpreting the quantization policy. Our agent gave quite
different quantization policy for edge and cloud accelera-
tors. For the activations, the depthwise convolution layers
Weights Activations Acc.-1 Acc.-5 Latency
PACT 4 bits 4 bits 62.44 84.19 7.86 ms
Ours flexible flexible 67.45 87.85 7.86 ms
PACT 6 bits 4 bits 67.51 87.84 11.10 ms
Ours flexible flexible 70.40 89.69 11.09 ms
PACT 6 bits 6 bits 70.46 89.59 19.99 ms
Ours flexible flexible 70.90 89.95 19.98 ms
Original 8 bits 8 bits 70.82 89.85 20.08 ms
Table 4: Latency-constrained quantization on BitFusion (for
MobileNet-V1 on ImageNet). Our HAQ framework can re-
duce the latency by 2×with almost no loss of accuracy com-
pared with the fixed bitwidth (8 bits) quantization.
are assigned much less bitwidth than the pointwise layers
on the edge; while on the cloud device, the bitwidth of these
two types of layers are similar to each other. For weights, the
bitwidth of these types of layers are nearly the same on the
edge; while on the cloud, the depthwise convolution layers
are assigned much more bitwidth than the pointwise convo-
lution layers.
We explain the difference of quantization policy between
edge and cloud by the roofline model (Williams et al 2009).
Many previous works use FLOPs or BitOPs as metrics to
measure computation complexity. However, they are not able
to directly reflect the latency, since there are many other fac-
tors influencing the hardware performance, such as memory
access cost and degree of parallelism (Sandler et al 2018;
Liu et al 2017). Taking computation and memory access into
account, the roofline model assumes that applications are ei-
ther computation-bound or memory bandwidth-bound, if not
fitting in on-chip caches, depending on their operation inten-
sity. Operation intensity is measured as operations (MACs in
neural networks) per DRAM byte accessed. A lower opera-
8 International Journal of Computer Vision
Table 1
V1-edge-w V1-edge-a V1-cloud-w V1-cloud-a
5 5 6 6
8 8 5 5
7 4 8 5
7 7 6 4
6 4 5 4
7 7 6 4
6 4 6 4
7 7 7 4
6 4 5 4
7 7 7 4
6 4 4 4
7 7 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 7 7 4
7 4 4 4
7 7 7 4
7 4 5 4
7 6 7 7
7 6 6 6
7 5 7 7
7 7 8 8
Table 2
V1-edge-w edge-a DW-weight-bit PW-weight-bit V1-edge-a DW-activation-bit PW-activation-bit Cloud-w cloud-a DW-weight-bit
2 5 2 -3 0 5 2 3 0 6 6 -4
3 2 8 0 -6 2 8 0 6 5 5 0
4 7 2 -5 0 4 2 2 0 7 5 -5
5 2 7 0 -5 2 7 0 5 6 5 0
6 6 2 -4 0 4 2 2 0 6 5 -4
7 2 7 0 -5 2 7 0 5 5 5 0
8 6 2 -4 0 4 2 2 0 6 5 -4
9 2 7 0 -5 2 7 0 5 5 5 0
10 6 2 -4 0 4 2 2 0 6 5 -4
11 2 7 0 -5 2 7 0 5 5 5 0
12 6 2 -4 0 4 2 2 0 6 5 -4
13 2 7 0 -5 2 7 0 5 5 5 0
14 7 2 -5 0 4 2 2 0 6 4 -4
15 2 7 0 -5 2 6 0 4 5 5 0
16 7 2 -5 0 4 2 2 0 6 4 -4
17 2 7 0 -5 2 6 0 4 5 5 0
18 7 2 -5 0 4 2 2 0 6 4 -4
19 2 7 0 -5 2 6 0 4 5 5 0
20 7 2 -5 0 4 2 2 0 6 4 -4
21 2 7 0 -5 2 7 0 5 5 5 0
22 7 2 -5 0 4 2 2 0 7 4 -5
23 2 7 0 -5 2 7 0 5 5 5 0
24 7 2 -5 0 4 2 2 0 7 5 -5
25 2 7 0 -5 2 6 0 4 5 5 0
26 7 2 -5 0 6 2 4 0 8 4 -6
27 2 7 0 -5 2 5 0 3 5 4 0
2 2 -2 -2 4 5 0
Table 3
V2-edge-w V2-edge-a V2-cloud-w V2-cloud-a
5 5 5 5
3 5 4 7
3 5 4 7
4 5 6 6
4 7 4 6
5 7 4 5
7 4 5 3
4 7 4 5
4 7 5 5
7 6 5 5
6 7 4 5
5 7 5 5
7 4 6 3
6 7 4 5
5 7 5 5
7 5 6 4
6 7 5 5
5 7 5 5
7 6 5 5
6 7 5 5
6 7 6 6
7 5 8 3
6 7 5 5
6 7 6 6
7 5 8 4
6 7 5 6
6 7 6 5
7 5 8 4
7 7 5 5
6 7 6 6
7 5 8 4
7 7 5 6
6 7 6 6
7 5 8 4
6 7 5 6
6 7 6 6
7 4 8 4
6 6 5 6
6 6 6 6
6 6 8 6
6 6 6 6
6 6 7 7
6 6 8 5
6 6 5 7
6 6 7 8
6 6 8 5
6 6 5 7
6 6 7 8
#b
it
#weight bit (pointwise) #weight bit (depthwise)
#activation bit (pointwise) #activation bit (depthwise)
8
6
4
2
4
6
#b
it
6
4
2
4
6
8
Table 4
V2-edge-w V2-edge-w #weight bit 
(depthwise)
#weight bit 
(pointwise)
V2-edge-a V2-edge-a #activation bit 
(depthwise)
#activation bit 
(pointwise)
V2-cloud-w V2-cloud-w #weight bit 
(depthwise)
#weight bit 
(pointwise)
2 5 5 -3 0 5 5 3 0 4 4 -2 0
3 3 3 0 -1 6 6 0 4 5 5 0 -3
4 3 3 0 -1 6 6 0 4 5 5 0 -3
5 6 6 -4 0 5 5 3 0 4 4 -2 0
6 4 4 0 -2 7 7 0 5 4 4 0 -2
7 4 4 0 -2 7 7 0 5 4 4 0 -2
8 7 7 -5 0 4 4 2 0 4 4 -2 0
9 4 4 0 -2 7 7 0 5 4 4 0 -2
10 4 4 0 -2 7 7 0 5 4 4 0 -2
11 7 7 -5 0 6 6 4 0 4 4 -2 0
12 5 5 0 -3 7 7 0 5 4 4 0 -2
13 5 5 0 -3 7 7 0 5 4 4 0 -2
14 7 7 -5 0 5 5 3 0 5 5 -3 0
15 6 6 0 -4 7 7 0 5 4 4 0 -2
16 5 5 0 -3 7 7 0 5 4 4 0 -2
17 7 7 -5 0 5 5 3 0 5 5 -3 0
18 6 6 0 -4 7 7 0 5 4 4 0 -2
19 6 6 0 -4 7 7 0 5 4 4 0 -2
20 7 7 -5 0 6 6 4 0 5 5 -3 0
21 6 6 0 -4 7 7 0 5 4 4 0 -2
22 6 6 0 -4 7 7 0 5 5 5 0 -3
23 7 7 -5 0 5 5 3 0 6 6 -4 0
24 7 7 0 -5 7 7 0 5 4 4 0 -2
25 6 6 0 -4 7 7 0 5 5 5 0 -3
26 7 7 -5 0 5 5 3 0 6 6 -4 0
27 7 7 0 -5 7 7 0 5 4 4 0 -2
#b
it
6
4
2
4
6
#b
it
6
4
2
4
6
lo
g#
# OPs per Byte (pointwise) # OPs per Byte (depthwise)
#b
it
Layer index
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
lo
g 
#
Layer index
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
lo
g#
   
   
   
bi
ts
#params (pointwise)  #params (depthwise) #weight bits (pointwise) #weight bits (depthwise)
6
4
2
4
6
OPs per Byte
lo
g#
Layer index
2 5 8 11 14 17 20 23 26
Edge
Cloud
MobileNet-V2 OPs per Byte
2
depthwise: fewer bits
0
4 
2
pointwise:more bits
depth-wise:more bitspoint-wise:fewer bits
DW:less bits PW:more bits
depthwise:more bits pointwise:fewer bits
depthwise:fewer bits at first few layers
depthwise:more bits at last few layers
depthwise:fewer bits pointwise:more bits
depthwise:more bits pointwise:fewer bits more params, fewer bits
layer
downsample
Edge
Cloud
layer
layer
layer
layer
layer
 1
Fig. 5: Quantization policy under latency constraints for MobileNet-V2 on BISMO (66.9 ms for the edge accelerator and
99.1 ms for the cloud accelerator). Similar to Figure 4, depthwise layer is assigned with fewer bits on the edge accelerator,
and pointwise layer is assigned with fewer bits on the cloud accelerator.
Weights Activations Acc.-1 Acc.-5 Energy
PACT 4 bits 4 bits 62.44 84.19 13.47 mJ
Ours flexible flexible 64.78 85.85 13.69 mJ
PACT 6 bits 4 bits 67.51 87.84 16.57 mJ
Ours flexible flexible 70.37 89.40 16.30 mJ
PACT 6 bits 6 bits 70.46 89.59 26.80 mJ
Ours flexible flexible 70.90 89.73 26.67 mJ
Original 8 bits 8 bits 70.82 89.95 31.03 mJ
Table 5: Energy-constrained quantization on BitFusion (for
MobileNet-V1 on ImageNet). Our HAQ framework reduces
the power consumption by 2× with nearly no loss of accu-
racy compared with the fixed bitwidth quantization.
tion intensity indicates that the model suffers more from the
memory access.
The bottom of Figure 4 shows the operation intensities
(operations per byte) of convolution layers in the MobileNet-
V1. Depthwise convolution is a memory bounded operation,
and the pointwise convolution is a computation bounded op-
eration. Our experiments show that when running MobileNet-
V1 on the edge devices with small batch size, its latency is
dominated by the depthwise convolution layers. Since the
feature maps take a major proportion in the memory of depth-
wise convolution layers, our agent gives the activations fewer
bits. In contrast, when running MobileNet-V1 on the cloud
with large batch size, both two types of layers have nearly
the equal influence on the speed. Therefore, our agent tries
to reduce the bitwidth of both activation and weights. How-
Roofline
dw_x_8 dw_y_8 dw_x dw_y pw_x_8 pw_y_8 pw_x pw_y
0.249997509 0.449592306 0.444438539 0.5996016676 15.98980242 12.797768246 15.98980242 12.797768246
0.249990035 0.4494983278 0.533299322 0.719541109 31.83756345 25.59509022 31.83756345 25.59509022
0.249990035 0.4494983278 0.333315619 0.5993311036 31.83756345 50.67932888 36.35942029 57.94119524
0.249960147 0.4494983278 0.533197314 0.7190828026 61.49019608 50.68107852 79.86629526 66.00493814
0.249960147 0.4494983278 0.380871413 0.5992356688 61.49019608 99.35505354 79.86629526 129.03930182
0.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.96694776 121.7494692 124.30348498
0.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.0251182
0.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.0251182
0.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.0251182
0.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 153.7372654 174.0286682
0.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.97008466 153.7372654 174.0286682
0.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 86.12894812 87.01669196 130.28100816
0.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 82.9261673 83.89758595 106.79738932
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6
Fix8bit
ours
0
45
90
135
180
0 40 80 120 160
Fix8bit
ours
Higher is better 
Ops/ByteOps/Byte
GOps/s GOps/s
HAQ improves depthwise 
layers’ roofline performance
HAQ improves pointwise 
layers’ roofline performance
Higher is better 
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6
Fix8bit
ours
0
45
90
135
180
0 40 80 120 160
Fixed
Ours
Higher is better 
Ops/Byte
GOps/s
HAQ improves depthwise 
layers’ roofline performance
Ops/Byte
GOps/s
Roofline performance of 
pointwise layers are improved.
Higher is better 
#weight bit (pointwise) #weight bit (depthwise)
#activation bit (pointwise) #activation bit (depthwise)
# OPs per Byte (pointwise) # OPs per Byte (depthwise)
8
6
4
2
4
6
6
4
2
4
6
8
depthwise: fewer bits pointwise:more bits
depthwise:more bits pointwise:fewer bits
layer
Edge
Cloud
layer#
bi
t
#b
it
0
4 
2
layerl
og
#
 1
Fig. 6: Roofline model of pointwise layers in MobileNet-
V1 (fixed-bitwidth in blue and mixed-precision in red). Our
mixed-precision framework improves the roofline perfor-
mance by a large margin.
ever, since the weights of the depthwise convolution layers
takes a small proportion of the memory, our agent increases
their bitwidth to preserve the network accuracy at low mem-
ory overhead. Figure 6 shows the roofline model before and
after HAQ. HAQ gives more reasonable policy to allocate
the bits for each layer and pushes all the points to the upper
right corner that is more efficient.
On edge accelerator, which has much less memory band-
width, our RL agent allocates fewer activation bits to the
depthwise convolutions since the activations dominates the
memory access. On cloud accelerator, which has more mem-
ory bandwidth, our agent allocates more bits to the depth-
wise convolutions and allocates fewer bits to the pointwise
convolutions to prevent it from being computation bounded.
International Journal of Computer Vision 9
A similar phenomenon can be observed in Figure 5 for
quantizing MobileNet-V2. Moreover, since the activation size
in the deeper layers gets smaller, they get assigned more bits.
Another interesting phenomenon we discover in Figure 5 is
that the downsample layer gets assigned more activation bits
than the adjacent layer. This is because downsampled layers
are more prone to lose information, so our agent learns to
assign more bits to the activations to compensate.
4.1.2 Quantization policy for BitFusion Architecture
In order to demonstrate the effectiveness of our framework
on different hardware architectures, we further compare our
framework with PACT (Choi et al 2018) under the latency
constraints on the BitFusion (Sharma et al 2018) architec-
ture. As demonstrated in Table 4, our framework performs
much better than the hand-craft policy with the same latency.
Also, it can achieve almost no degradation of accuracy with
only half of the latency used by the original MobileNet-V1
model (from 20.08 to 11.09 ms). Therefore, our framework
is indeed very flexible and can be applied to different hard-
ware platforms.
4.2 Energy-Constrained Quantization
We then evaluate our framework under the energy constraints
on the BitFusion (Sharma et al 2018) architecture. Simi-
lar to the latency-constrained experiments, we compare our
framework with PACT (Choi et al 2018) which uses fixed
number of bits for both weights and activations. From Ta-
ble 5, we can clearly see that our framework outperforms
the rule-based baseline: it achieves much better performance
while consuming similar amount of energy. In particular,
our framework is able to achieve almost no loss of accu-
racy with nearly half of the energy consumption of the orig-
inal MobileNet-V1 model (from 31.03 to 16.57 mJ), which
suggests that flexible bitwidths can indeed help reduce the
energy consumption.
4.3 Model Size-Constrained Quantization
We further evaluate our HAQ framework under the model
size constraints. Following Han et al (2016), we employ the
k-means algorithm to quantize the values into k centroids
instead of using the linear quantization for compression.
We compare our framework with Deep Compression (Han
et al 2016) on MobileNets and ResNet-50. From Table 6, we
can see that our framework performs much better than Deep
Compression: it achieves higher accuracy with the same model
size. For MobileNets which are already very compactly de-
signed, our framework can preserve the performance to some
extent; while Deep Compression significantly degrades the
performance especially when the model size is very small.
For instance, when Deep Compression quantizes the weights
of MobileNet-V1 to 2 bits, the accuracy drops significantly
from 70.90 to 37.62; while our framework can still achieve
57.14 of accuracy with the same model size, which is be-
cause our framework makes full use of the flexible bitwidths.
Discussions. In Figure 7, we visualize the bitwidth alloca-
tion strategy for MobileNet-V2. From this figure, we can
observe that our framework assigns more bitwidths to the
weights in depthwise convolution layers than pointwise con-
volution layers. Intuitively, this is because the number of pa-
rameters in the former is much smaller than the latter. Com-
paring Figure 5 and Figure 7, the policies are drastically dif-
ferent under different optimization objectives (fewer bitwiths
for depthwise convolutions under latency optimization, more
bitwidths for depthwise convolutions under model size opti-
mization). Our framework succeeds in learning to adjust its
bitwidth policy under different constraints.
4.4 Accuracy-Guaranteed Quantization
Apart from the resource-constrained experiments, we also
evaluate our framework under the accuracy-guaranteed sce-
nario, that is to say, we aim to minimize the resource (i.e.,
latency and energy) we use while preserving the accuracy.
Instead of using the resource-constrained action space
in Section 3.2, we define a new reward function R that takes
both the resource and the accuracy into consideration:
R = Rlatency+Renergy+Raccuracy. (14)
Here, the reward functions R* are defined to encourage each
term to be as good as possible:
Rlatency = λlatency× (latencyquant− latencyorigin),
Renergy = λenergy× (energyquant− energyorigin),
Raccuracy = λaccuracy× (accuracyquant− accuracyorigin),
(15)
where λ* are scaling factors that encourage the RL agent to
trade off between the computation resource and the accu-
racy. We set λlatency and λenergy to 1, and λaccuracy to 20 in
our experiments to ensure that our RL agent will prioritize
the accuracy over the computation resource.
We choose to perform our experiments on a ten-category
subset of ImageNet as it is very challenging to preserve the
accuracy while reducing the computation resource. In Fig-
ure 8, we illustrate the exploration curves of our RL agents,
and we can observe that the exploration process can be di-
vided into three phases. In the first phase, our RL agent puts
its focus on the accuracy: it tries to preserve the accuracy
while completely ignoring the latency and the energy con-
sumption. In the second phase, the accuracy begins to be
10 International Journal of Computer Vision
MobileNet-V1 MobileNet-V2 ResNet-50
Weights Acc.-1 Acc.-5 Model Size Acc.-1 Acc.-5 Model Size Acc.-1 Acc.-5 Model Size
Han et al (2016) 2 bits 37.62 64.31 1.09 MB 58.07 81.24 0.96 MB 68.95 88.68 6.32 MB
Ours flexible 57.14 81.87 1.09 MB 66.75 87.32 0.95 MB 70.63 89.93 6.30 MB
Han et al (2016) 3 bits 65.93 86.85 1.60 MB 68.00 87.96 1.38 MB 75.10 92.33 9.36 MB
Ours flexible 67.66 88.21 1.58 MB 70.90 89.76 1.38 MB 75.30 92.45 9.22 MB
Han et al (2016) 4 bits 71.14 89.84 2.10 MB 71.24 89.93 1.79 MB 76.15 92.88 12.40 MB
Ours flexible 71.74 90.36 2.07 MB 71.47 90.23 1.79 MB 76.14 92.89 12.14 MB
Original 32 bits 70.90 89.90 16.14 MB 71.87 90.32 13.37 MB 76.15 92.86 97.49 MB
Table 6: Model size-constrained quantization on ImageNet. Compared with Deep Compression (Han 2017), our framework
achieves higher accuracy under similar model size (especially under high compression ratio).
Table 1
V1-edge-w V1-edge-a V1-cloud-w V1-cloud-a
5 5 6 6
8 8 5 5
7 4 8 5
7 7 6 4
6 4 5 4
7 7 6 4
6 4 6 4
7 7 7 4
6 4 5 4
7 7 7 4
6 4 4 4
7 7 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 6 7 4
7 4 5 4
7 7 7 4
7 4 4 4
7 7 7 4
7 4 5 4
7 6 7 7
7 6 6 6
7 5 7 7
7 7 8 8
Table 2
V1-edge-w edge-a DW-weight-bit PW-weight-bit V1-edge-a DW-activation-bit PW-activation-bit Cloud-w cloud-a DW-weight-bit
2 5 2 -3 0 5 2 3 0 6 6 -4
3 2 8 0 -6 2 8 0 6 5 5 0
4 7 2 -5 0 4 2 2 0 7 5 -5
5 2 7 0 -5 2 7 0 5 6 5 0
6 6 2 -4 0 4 2 2 0 6 5 -4
7 2 7 0 -5 2 7 0 5 5 5 0
8 6 2 -4 0 4 2 2 0 6 5 -4
9 2 7 0 -5 2 7 0 5 5 5 0
10 6 2 -4 0 4 2 2 0 6 5 -4
11 2 7 0 -5 2 7 0 5 5 5 0
12 6 2 -4 0 4 2 2 0 6 5 -4
13 2 7 0 -5 2 7 0 5 5 5 0
14 7 2 -5 0 4 2 2 0 6 4 -4
15 2 7 0 -5 2 6 0 4 5 5 0
16 7 2 -5 0 4 2 2 0 6 4 -4
17 2 7 0 -5 2 6 0 4 5 5 0
18 7 2 -5 0 4 2 2 0 6 4 -4
19 2 7 0 -5 2 6 0 4 5 5 0
20 7 2 -5 0 4 2 2 0 6 4 -4
21 2 7 0 -5 2 7 0 5 5 5 0
22 7 2 -5 0 4 2 2 0 7 4 -5
23 2 7 0 -5 2 7 0 5 5 5 0
24 7 2 -5 0 4 2 2 0 7 5 -5
25 2 7 0 -5 2 6 0 4 5 5 0
26 7 2 -5 0 6 2 4 0 8 4 -6
27 2 7 0 -5 2 5 0 3 5 4 0
2 2 -2 -2 4 5 0
Table 3
V2-edge-w V2-edge-a V2-cloud-w V2-cloud-a
5 5 5 5
3 5 4 7
3 5 4 7
4 5 6 6
4 7 4 6
5 7 4 5
7 4 5 3
4 7 4 5
4 7 5 5
7 6 5 5
6 7 4 5
5 7 5 5
7 4 6 3
6 7 4 5
5 7 5 5
7 5 6 4
6 7 5 5
5 7 5 5
7 6 5 5
6 7 5 5
6 7 6 6
7 5 8 3
6 7 5 5
6 7 6 6
7 5 8 4
6 7 5 6
6 7 6 5
7 5 8 4
7 7 5 5
6 7 6 6
7 5 8 4
7 7 5 6
6 7 6 6
7 5 8 4
6 7 5 6
6 7 6 6
7 4 8 4
6 6 5 6
6 6 6 6
6 6 8 6
6 6 6 6
6 6 7 7
6 6 8 5
6 6 5 7
6 6 7 8
6 6 8 5
6 6 5 7
6 6 7 8
#weight bit (pointwise) #weight bit (depthwise)
#activation bit (pointwise) #activation bit (depthwise)
8
6
4
2
4
6
6
4
2
4
6
8
Table 4
V2-edge-w V2-edge-w #weight bit 
(depthwise)
#weight bit 
(pointwise)
V2-edge-a V2-edge-a #activation bit 
(depthwise)
#activation bit 
(pointwise)
V2-cloud-w V2-cloud-w #weight bit 
(depthwise)
#weight bit 
(pointwise)
2 5 5 -3 0 5 5 3 0 4 4 -2 0
3 3 3 0 -1 6 6 0 4 5 5 0 -3
4 3 3 0 -1 6 6 0 4 5 5 0 -3
5 6 6 -4 0 5 5 3 0 4 4 -2 0
6 4 4 0 -2 7 7 0 5 4 4 0 -2
7 4 4 0 -2 7 7 0 5 4 4 0 -2
8 7 7 -5 0 4 4 2 0 4 4 -2 0
9 4 4 0 -2 7 7 0 5 4 4 0 -2
10 4 4 0 -2 7 7 0 5 4 4 0 -2
11 7 7 -5 0 6 6 4 0 4 4 -2 0
12 5 5 0 -3 7 7 0 5 4 4 0 -2
13 5 5 0 -3 7 7 0 5 4 4 0 -2
14 7 7 -5 0 5 5 3 0 5 5 -3 0
15 6 6 0 -4 7 7 0 5 4 4 0 -2
16 5 5 0 -3 7 7 0 5 4 4 0 -2
17 7 7 -5 0 5 5 3 0 5 5 -3 0
18 6 6 0 -4 7 7 0 5 4 4 0 -2
19 6 6 0 -4 7 7 0 5 4 4 0 -2
20 7 7 -5 0 6 6 4 0 5 5 -3 0
21 6 6 0 -4 7 7 0 5 4 4 0 -2
22 6 6 0 -4 7 7 0 5 5 5 0 -3
23 7 7 -5 0 5 5 3 0 6 6 -4 0
24 7 7 0 -5 7 7 0 5 4 4 0 -2
25 6 6 0 -4 7 7 0 5 5 5 0 -3
26 7 7 -5 0 5 5 3 0 6 6 -4 0
27 7 7 0 -5 7 7 0 5 4 4 0 -2
#b
it
6
4
2
4
6
#b
it
6
4
2
4
6
# OPs per Byte (pointwise) # OPs per Byte (depthwise)
#b
it
Layer index
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
lo
g 
#
Layer index
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
#weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
lo
g#
   
   
   
#b
its
#params (pointwise)  #params (depthwise) #weight bits (pointwise) #weight bits (depthwise)
6
4
2
4
6
lo
g#
Layer index
2 5 8 11 14 17 20 23 26
Edge
Cloud
MobileNet-V2 OPs per Byte
2
depthwise: fewer bits
0
4 
2
pointwise:more bits
depthwise:more bits pointwise:fewer bits
DW:less bits PW:more bits
depthwise:more bits pointwise:fewer bits
depthwise:fewer bits at first few layers
depthwise:more bits at last few layers
depthwise:fewer bits pointwise:more bits
depthwise:more bits pointwise:fewer bits more params, fewer bits
layer
downsample
Edge
Cloud
layer
layer
layer
layer
layer
#b
it
#b
it
lo
g#
 1
Fig. 7: Quantization policy under model size constraints for MobileNet-V2. Our RL agent allocates more bits to the depthwise
convolutions, since depthwise convolutions have fewer number of parameters.
more stable, and our RL agent starts to aggressively reduce
the latency and the energy. In the third phase, our RL agent
converges to the best policy it has found. We conjecture that
this interesting behavior is because that the scaling factor
λaccuracy is much larger than the other two, which encour-
ages our agent to first optimize the value of accuracy, and
after its value has been stabilized, our agent then attempts to
reduce the value of latency and energy to further optimize
the reward value (see the reward curve in Figure 8).
4.5 Integration with Architecture Search and Pruning
We integrate the neural architecture search (Cai et al 2019)
and automated channel pruning (He et al 2018) with HAQ
to demonstrate that our method is orthogonal to other Au-
toML methods. In Figure 9, we observe significant improve-
ments over the baselines including ProxylessNAS (with 8-
bit quantization), ProxylessNAS + AMC (with 8-bit quanti-
zation), MobileNetV2 (with 4-bit / 6-bit quantization), and
MobileNetV2 + HAQ (with mixed-precision quantization).
5 Analysis
In this section, we first compare with sample efficiency of
different optimization methods; then, we show the general-
Search Time Acc.-1 Acc.-5 Latency
ES 17 hours 65.73 86.81 45.45 ms
BO 74 hours 66.28 87.22 45.47 ms
RL (Ours) 17 hours 67.40 87.90 45.51 ms
ES 17 hours 69.11 88.80 57.73 ms
BO 74 hours 70.40 89.56 57.68 ms
RL (Ours) 17 hours 70.58 89.77 57.70 ms
Table 7: Comparison between different optimization meth-
ods. RL outperforms EA and BO and is 4× faster than BO
in terms of the total search time.
ization and transfer learning ability of our framework; we
finally interpret the quantization policy given by HAQ.
5.1 Optimization Methods
We leverage the reinforcement learning (RL) as our opti-
mization method. In addition, we also compared with other
optimizers including Bayesian optimization (BO) and evo-
lutionary algorithm (EA). Similar to the configurations of
RL, we model the outputs of BO and EA as the number of
bits of different layers and the objectives of BO and EA as
maximizing the validation accuracy of the quantized model.
International Journal of Computer Vision 11
                                                                                                             
R
ew
ar
d
En
er
gy
La
te
nc
y
A
cc
ur
ac
y
Number of episodes
Phase I Phase II Phase III
preserve accuracy
reduce latency
reduce energy
ac
cu
ra
cy
ac
cu
ra
cy
ac
cu
ra
cy
ac
cu
ra
cy
BitFusion: energy constrained
BitFusion: latency constrained
BISMO-Cloud: latency constrained
BISMO-Edge: latency constrained
Episode
                                                                                                             
R
ew
ar
d
En
er
gy
La
te
nc
y
A
cc
ur
ac
y
Number of episodes
Phase I Phase II Phase III
preserve accuracy
reduce latency
reduce energy
 1
Fig. 8: Exploration curves of accuracy-guaranteed quantiza-
tion for MobileNet-V1. Our RL agent first tries to preserve
the accuracy while completely ignoring the latency and the
energy consumption; after the accuracy begins to be more
stable, it starts to aggressively reduce the latency and the
energy; it finally converges to the best policy it has found.
Fig. 9: Integrating NAS and AMC with HAQ together fur-
ther improves the accuracy-latency trade-off by a significant
margin.
As a fair comparison, we executed in total 600 runs (sam-
ples) for each optimization method. The performance com-
parison is in Table 7. All experiments are conducted on the
BISMO hardware with MobileNet-V1. We observe that RL
performs much better than EA and BO and is 4× faster than
BO in terms of the total search time. This, we believe, is be-
cause BO and EA do not make use of the state encoding,
which might lead to a worse sample efficiency.
Bitwidth Acc.-1 Acc.-5 Latency
PACT 4 bits 61.39 83.72 52.15 ms
Ours (search for V2) flexible 66.99 87.33 52.12 ms
Ours (transfer from V1) flexible 65.80 86.60 52.06 ms
PACT 5 bits 68.84 88.58 66.94 ms
Ours (search for V2) flexible 70.90 89.91 66.92 ms
Ours (transfer from V1) flexible 69.90 89.24 66.93 ms
Table 8: Comparisons between our agent’s transfer results
(from MobileNet-V1 to MobileNet-V2), its direct search
results on MobileNet-V2, and the fixed-bitwidth baseline
(PACT). Our RL gent is able to generalize well to different
network architectures: our quantization policy transferred
from V1 to V2 performs better than the fixed-bitwidth base-
line and is only slighly worse than the quantization policy
directly searched for V2.
Constraint Bitwidth BitOPs latency Acc.-1 Acc.-5
BitOPs flexible 8.17 G 85.06 ms 70.29 89.52
Latency flexible 8.01 G 66.92 ms 70.90 89.91
BitOPs flexible 11.36 G 97.99 ms 71.41 90.12
Latency flexible 11.17 G 82.34 ms 71.89 90.36
Table 9: BitOPs-constrained quantization on BISMO
(MobileNet-V2 on ImageNet).
5.2 Generalization and Transfer Learning
Another merit of the reinforcement learning is that its agent
is able to generalize to different environments (i.e., network
architectures). In order to evaluate the transfer ability of our
framework, we first train our agent for MobileNet-V1 un-
der the latency constraint, and we then directly evaluate our
agent on MobileNet-V2 by feeding its network architecture
information in. In Table 8, we compare our agent’s transfer
results (from V1 to V2) with its direct search results (for
V2) and the fixed-bitwidth baseline (PACT). Our quantiza-
tion policy transferred from V1 to V2 still performs better
than the fixed-bitwidth baseline and is only slightly worse
than the quantization policy directly searched for V2. This
experiment validates that our RL agent generalize well to
different network architectures.
5.3 Importantance of Hardware-Awareness
To evaluate the necessity of involving the hardware in the
loop, we replace the BISMO accelerator with theoretical
BitOPs analysis, which calculates the latency by FLOPs/s×
Bitweight×Bitactivation for each layer. The results are listed in
Table 9, which shows that under similar BitOPs constrains,
12 International Journal of Computer Vision
Table 4
A:s=2 W:s=2 A:s=1 W:s=1 A:dw=1 W:dw=1 A:dw=0 W:dw=0 A:k=3 W:k=3 A:k=1 W:k=1
2 7 7 8 8 8 8 7 7 8 8 7 7
3 6 7 6 7 7 8 6 7 7 7 6 7
4 6 7 6 7 7 8 6 7 6 7 6 7
5 7 8 7 8 7 8 6 7 7 8 7 8
6 7 7 6 5 6 7 6 5 6 6 6 5
7 6 7 6 5 6 7 6 5 5 6 6 5
8 6 7 6 7 6 7 5 6 6 7 6 7
9 7 7 6 5 6 7 6 5 6 6 6 5
10 6 7 6 5 6 7 6 5 5 6 6 5
11 6 7 6 7 6 7 6 6 6 7 6 7
12 7 7 6 5 6 7 6 5 6 5 6 5
13 6 7 6 6 6 7 6 6 5 5 6 6
14 6 7 6 7 6 7 5 6 6 7 6 7
15 7 7 6 6 6 7 6 6 6 5 6 6
16 6 7 6 6 6 7 6 6 5 5 6 6
17 6 7 6 7 6 7 5 6 6 7 6 7
18 7 7 6 6 6 7 6 6 6 5 6 6
19 6 7 6 6 6 7 6 6 5 5 6 6
20 6 7 6 7 6 7 6 6 6 7 6 8
21 7 7 6 6 6 7 6 6 5 5 6 6
22 6 7 5 6 6 7 5 6 5 6 5 6
23 6 8 5 8 5 8 5 6 5 8 5 7
24 8 8 6 7 6 7 6 7 6 5 6 7
25 6 7 6 7 5 7 6 7 5 6 6 7
26 6 8 5 8 5 8 5 6 5 8 5 8
27 8 8 6 7 6 7 6 7 6 5 6 7
28 7 7 6 6 5 7 6 6 5 6 6 6
29 6 8 5 8 5 8 5 6 5 8 5 8
30 8 8 6 7 6 7 6 7 5 6 6 7
31 7 7 6 6 5 7 6 6 5 6 6 6
32 6 8 5 8 5 8 5 6 5 8 5 8
33 8 8 6 7 6 7 6 7 5 6 6 7
34 8 8 5 6 5 8 5 6 5 7 5 6
35 6 8 4 8 4 8 5 8 4 8 6 8
36 8 8 8 8 7 8 8 8 5 6 8 8
37 8 8 5 6 5 8 5 6 5 7 5 6
38 6 8 4 8 4 8 5 8 4 8 7 8
39 8 8 8 8 7 8 8 8 6 6 8 8
40 8 8 5 6 5 8 5 6 5 7 5 6
41 6 8 4 8 6 8 8 8 6 8 8 8
42 8 8 8 8 8 8 8 8 7 7 8 8
43 8 8 8 8 8 8 8 8 8 8 8 8
44 8 8 8 8 8 8 8 8 8 8 8 8
45 8 8 8 8 8 8 8 8 8 8 8 8
46 8 8 8 8 8 8 8 8 8 8 8 8
47 8 8 8 8 8 8 8 8 8 8 8 8
48 8 8 8 8 8 8 8 8 8 8 8 8
49 8 8 8 8 8 8 8 8 8 8 8 8
50 8 8 8 8 8 8 8 8 8 8 8 8
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
stride=2 stride=1
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
stride=2 stride=1
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
A:dw=1 A:dw=0
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
depth-wise point-wise
 1
(a) The agent allocates more activation bits to stride=2 layers, which compensates for the information loss from downsampling.
Table 4
A:s=2 W:s=2 A:s=1 W:s=1 A:dw=1 W:dw=1 A:dw=0 W:dw=0 A:k=3 W:k=3 A:k=1 W:k=1
2 7 7 8 8 8 8 7 7 8 8 7 7
3 6 7 6 7 7 8 6 7 7 7 6 7
4 6 7 6 7 7 8 6 7 6 7 6 7
5 7 8 7 8 7 8 6 7 7 8 7 8
6 7 7 6 5 6 7 6 5 6 6 6 5
7 6 7 6 5 6 7 6 5 5 6 6 5
8 6 7 6 7 6 7 5 6 6 7 6 7
9 7 7 6 5 6 7 6 5 6 6 6 5
10 6 7 6 5 6 7 6 5 5 6 6 5
11 6 7 6 7 6 7 6 6 6 7 6 7
12 7 7 6 5 6 7 6 5 6 5 6 5
13 6 7 6 6 6 7 6 6 5 5 6 6
14 6 7 6 7 6 7 5 6 6 7 6 7
15 7 7 6 6 6 7 6 6 6 5 6 6
16 6 7 6 6 6 7 6 6 5 5 6 6
17 6 7 6 7 6 7 5 6 6 7 6 7
18 7 7 6 6 6 7 6 6 6 5 6 6
19 6 7 6 6 6 7 6 6 5 5 6 6
20 6 7 6 7 6 7 6 6 6 7 6 8
21 7 7 6 6 6 7 6 6 5 5 6 6
22 6 7 5 6 6 7 5 6 5 6 5 6
23 6 8 5 8 5 8 5 6 5 8 5 7
24 8 8 6 7 6 7 6 7 6 5 6 7
25 6 7 6 7 5 7 6 7 5 6 6 7
26 6 8 5 8 5 8 5 6 5 8 5 8
27 8 8 6 7 6 7 6 7 6 5 6 7
28 7 7 6 6 5 7 6 6 5 6 6 6
29 6 8 5 8 5 8 5 6 5 8 5 8
30 8 8 6 7 6 7 6 7 5 6 6 7
31 7 7 6 6 5 7 6 6 5 6 6 6
32 6 8 5 8 5 8 5 6 5 8 5 8
33 8 8 6 7 6 7 6 7 5 6 6 7
34 8 8 5 6 5 8 5 6 5 7 5 6
35 6 8 4 8 4 8 5 8 4 8 6 8
36 8 8 8 8 7 8 8 8 5 6 8 8
37 8 8 5 6 5 8 5 6 5 7 5 6
38 6 8 4 8 4 8 5 8 4 8 7 8
39 8 8 8 8 7 8 8 8 6 6 8 8
40 8 8 5 6 5 8 5 6 5 7 5 6
41 6 8 4 8 6 8 8 8 6 8 8 8
42 8 8 8 8 8 8 8 8 7 7 8 8
43 8 8 8 8 8 8 8 8 8 8 8 8
44 8 8 8 8 8 8 8 8 8 8 8 8
45 8 8 8 8 8 8 8 8 8 8 8 8
46 8 8 8 8 8 8 8 8 8 8 8 8
47 8 8 8 8 8 8 8 8 8 8 8 8
48 8 8 8 8 8 8 8 8 8 8 8 8
49 8 8 8 8 8 8 8 8 8 8 8 8
50 8 8 8 8 8 8 8 8 8 8 8 8
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
stride=2 stride=1
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
stride=2 stride=1
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
A:dw=1 A:dw=0
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
depth-wise point-wise
 1
(b) The agent allocates more weight bits to stride=2 layers, which compensates for the information loss from downsampling.
Table 4
A:s=2 W:s=2 A:s=1 W:s=1 A:dw=1 W:dw=1 A:dw=0 W:dw=0 A:k=3 W:k=3 A:k=1 W:k=1
2 7 7 8 8 8 8 7 7 8 8 7 7
3 6 7 6 7 7 8 6 7 7 7 6 7
4 6 7 6 7 7 8 6 7 6 7 6 7
5 7 8 7 8 7 8 6 7 7 8 7 8
6 7 7 6 5 6 7 6 5 6 6 6 5
7 6 7 6 5 6 7 6 5 5 6 6 5
8 6 7 6 7 6 7 5 6 6 7 6 7
9 7 7 6 5 6 7 6 5 6 6 6 5
10 6 7 6 5 6 7 6 5 5 6 6 5
11 6 7 6 7 6 7 6 6 6 7 6 7
12 7 7 6 5 6 7 6 5 6 5 6 5
13 6 7 6 6 6 7 6 6 5 5 6 6
14 6 7 6 7 6 7 5 6 6 7 6 7
15 7 7 6 6 6 7 6 6 6 5 6 6
16 6 7 6 6 6 7 6 6 5 5 6 6
17 6 7 6 7 6 7 5 6 6 7 6 7
18 7 7 6 6 6 7 6 6 6 5 6 6
19 6 7 6 6 6 7 6 6 5 5 6 6
20 6 7 6 7 6 7 6 6 6 7 6 8
21 7 7 6 6 6 7 6 6 5 5 6 6
22 6 7 5 6 6 7 5 6 5 6 5 6
23 6 8 5 8 5 8 5 6 5 8 5 7
24 8 8 6 7 6 7 6 7 6 5 6 7
25 6 7 6 7 5 7 6 7 5 6 6 7
26 6 8 5 8 5 8 5 6 5 8 5 8
27 8 8 6 7 6 7 6 7 6 5 6 7
28 7 7 6 6 5 7 6 6 5 6 6 6
29 6 8 5 8 5 8 5 6 5 8 5 8
30 8 8 6 7 6 7 6 7 5 6 6 7
31 7 7 6 6 5 7 6 6 5 6 6 6
32 6 8 5 8 5 8 5 6 5 8 5 8
33 8 8 6 7 6 7 6 7 5 6 6 7
34 8 8 5 6 5 8 5 6 5 7 5 6
35 6 8 4 8 4 8 5 8 4 8 6 8
36 8 8 8 8 7 8 8 8 5 6 8 8
37 8 8 5 6 5 8 5 6 5 7 5 6
38 6 8 4 8 4 8 5 8 4 8 7 8
39 8 8 8 8 7 8 8 8 6 6 8 8
40 8 8 5 6 5 8 5 6 5 7 5 6
41 6 8 4 8 6 8 8 8 6 8 8 8
42 8 8 8 8 8 8 8 8 7 7 8 8
43 8 8 8 8 8 8 8 8 8 8 8 8
44 8 8 8 8 8 8 8 8 8 8 8 8
45 8 8 8 8 8 8 8 8 8 8 8 8
46 8 8 8 8 8 8 8 8 8 8 8 8
47 8 8 8 8 8 8 8 8 8 8 8 8
48 8 8 8 8 8 8 8 8 8 8 8 8
49 8 8 8 8 8 8 8 8 8 8 8 8
50 8 8 8 8 8 8 8 8 8 8 8 8
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
stride=2 stride=1
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
stride=2 stride=1
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
A:dw=1 A:dw=0
4
5
6
7
8
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
depth-wise point-wise
 1
(c) The agent allocates more weight bits to depthwise layers than pointwise layers, because depthwise layers have fewer weights.
Fig. 10: We change only one dimension of the state vector, and run the actor network again to observe how the action changes
across different layers.
BitOPs constrained experiments get worse latency than the
experiments with hardware-in-the-loop, and BitOPs constrained
experiments could achieve better accuracy when the con-
strain is tight. The reason is that BitOPs and hardware la-
tency are not linearly correlated, so the similar BitOPs may
correspond to totally different latency if the layer is mem-
ory bottlenecked. The agent chooses to give more bits to
depthwise layers which have less FLOPs, but more memory
access (which means more latency in edge hardware).
5.4 Performance on Large Model
We evaluate our framework on a larger ResNet-50 model
with the same search scheme and finetune policy as in Ta-
ble 3, and the hardware platform is BISMO edge hardware
simulator. Table 10 shows that the improvement of HAQ is
not remarkable on the large model like ResNet50. The rea-
son is that ResNet-50 is highly redundant, even PACT can
already quantize it to 4 bits without much accuracy loss.
Critic
Actor
Agent: DDPG
……
……
Environment: HW Architecture-aware Quantization
Bit-serial HW Accelerator
Action
1 1 1 0 1 0 1 0 0 1 0 1 0
1 1 1 0 1 0 1 0 0 1
1 1 1 0 1 0 1 0 0 1 0
Layer t-1 
6b/7b
Layer t 
4b/6b
Layer t+1 
5b/6b
State  
Reward
Latency
Hardware 
Mapping
5 bit weight 6 bit activation
CNN Quantizer
PE
PE PE
PEPE
PE
⋯ ⋯⋯
&
<<
Cycle 0 (MSB)
Cycle T (LSB)
+
⋯
⋯ ⋯⋯
an ⋯
⋯
w0
⋯⋯⋯
wn ⋯ a0
+
Hardware
4bit
3bit
5bit
5bit
6bit
AutoML
Quantization
2
s=1
s=2
AgentState Action
5bit
6bitKeep the same
Change stride embedding only Increased
id in out k s f p dw w a
id in out k s f p dw w a
 1
Fig. 11: Policy interpretation. From the model, we first select
several layers; then, we only change (flip) one dimension in
the state vector; finally, we run our RL agent’s actor network
again to see how that particular factor affects its dicision.
Weights Activations Acc.-1 Acc.-5 Latency
PACT 2 bits 4 bits 74.06 91.78 80.03 ms
Ours flexible flexible 74.42 91.92 80.38 ms
PACT 3 bits 3 bits 74.73 92.11 85.49 ms
Ours flexible flexible 74.98 92.37 84.97 ms
PACT 4 bits 4 bits 76.17 93.03 128.55 ms
Ours flexible flexible 76.22 93.15 129.55 ms
Original 8 bits 8 bits 76.64 93.26 446.96 ms
Table 10: Latency-constrained quantization on BISMO (for
ResNet-50 on ImageNet).
Therefore, there is not much room for HAQ to further im-
prove it.
5.5 Policy Interpretation
In Section 4, we provided intuitive explanations of our agent’s
policies. In this section, we quantitatively interpret our agent’s
quantization policy. As illustrated in Figure 11, we first se-
lect several layers from MobileNet-V2; then for each layer,
we change (flip) only one dimension in the state vector (in
the example, changing the convolution stride from 1 to 2);
finally, we run feedforward on our RL agent’s actor net-
work again to see how that particular factor (i.e., depthwise,
downsample) affects its decision.
From Figure 10, we can clearly observe that some fac-
tors will affect the actions. For instance, if we only change
the stride embedding of each layer in MobileNet-V2, we
could observe that our agent will allocate more activation
International Journal of Computer Vision 13
bits to downsample layers (stride=2), and this phenomenon
is more obviously at deep layers. As for weight bits, our
agent also allocate more bits for downsample layers. More-
over, if we only change the depthwise/pointwise embedding,
we could observe that pointwise layers will be allocated fewer
weight bits, the reason may lay on pointwise layer are more
computation bounded, fewer weight bits will obviously re-
duce the computation complexity.
6 Conclusion
In this paper, we propose an automated framework for quan-
tization, Hardware-Aware Automated Quantization (HAQ),
which does not require any domain experts and rule-based
heuristics. We provide a learning-based method that can search
the quantization policy with hardware feedback. Compared
with indirect proxy signals, our framework can offer a spe-
cialized quantization solution for different hardware plat-
forms. Extensive experiments demonstrate that our frame-
work performs better than conventional rule-based approaches
for multiple objectives: latency, energy and model size. Our
framework reveals that the optimal policies on different hard-
ware architectures are drastically different, and we inter-
preted the implication of those policies. We believe the in-
sights will inspire the future software and hardware co-design
for efficient deployment of deep neural networks.
Acknowledgements
We thank NSF Career Award #1943349, MIT-IBM Watson
AI Lab, Samsung, SONY, Xilinx, TI and AWS for support-
ing this research.
References
Apple (2018) Apple describes 7nm A12 bionic
chips. URL http://www.eenewsanalog.com/news/
apple-describes-7nm-a12-bionic-chip/page/0/1 2
Cai H, Yang J, Zhang W, Han S, Yu Y (2018) Path-Level Network
Transformation for Efficient Architecture Search. In: ICML 3
Cai H, Zhu L, Han S (2019) ProxylessNAS: Direct Neural Architecture
Search on Target Task and Hardware. In: ICLR 2, 10
Choi J, Wang Z, Venkataramani S, Chuang PIJ, Srinivasan V,
Gopalakrishnan K (2018) PACT: Parameterized Clipping Activa-
tion for Quantized Neural Networks. arXiv 2, 6, 9
Chollet F (2017) Xception - Deep Learning with Depthwise Separable
Convolutions. In: CVPR 5, 6
Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016)
Binarized Neural Networks: Training Deep Neural Networks with
Weights and Activations Constrained to +1 or -1. arXiv 3
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) ImageNet - A
large-scale hierarchical image database. In: CVPR 6
Han S (2017) Efficient Methods and Hardware for Deep Learning. PhD
thesis 10
Han S, Mao H, Dally W (2016) Deep Compression: Compressing Deep
Neural Networks with Pruning, Trained Quantization and Huff-
man Coding. In: ICLR 2, 3, 9, 10
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image
Recognition. In: CVPR 2
He Y, Zhang X, Sun J (2017) Channel pruning for accelerating very
deep neural networks. In: ICCV 3
He Y, Lin J, Liu Z, Wang H, Li LJ, Han S (2018) AMC: AutoML
for Model Compression and Acceleration on Mobile Devices. In:
ECCV 3, 10
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand
T, Andreetto M, Adam H (2017) MobileNets: Efficient Convolu-
tional Neural Networks for Mobile Vision Applications. arXiv 2,
6
Imagination (2018) Powervr neural network accelerator. URL
https://www.imgtec.com/vision-ai/powervr-series2nx/
powervr-ax2145-nna/ 2
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard AG, Adam H,
Kalenichenko D (2018) Quantization and Training of Neural Net-
works for Efficient Integer-Arithmetic-Only Inference. In: CVPR
2, 3
Kingma D, Ba J (2015) Adam - A Method for Stochastic Optimization.
In: ICLR 6
Krishnamoorthi R (2018) Quantizing deep convolutional networks for
efficient inference - A whitepaper. arXiv 3
Lillicrap T, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D,
Wierstra D (2016) Continuous control with deep reinforcement
learning. In: ICLR 3, 5
Liu C, Zoph B, Neumann M, Shlens J, Hua W, Li LJ, Fei-Fei L, Yuille
A, Huang J, Murphy K (2018) Progressive Neural Architecture
Search. In: ECCV 3
Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017) Learning efficient
convolutional networks through network slimming. In: ICCV 3, 7
Nvidia (2018) Nvidia tensor cores. URL https://www.nvidia.com/
en-us/data-center/tensorcore/ 2
Pham H, Guan MY, Zoph B, Le QV, Dean J (2018) Efficient Neural
Architecture Search via Parameter Sharing. In: ICML 3
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) XNOR-Net -
ImageNet Classification Using Binary Convolutional Neural Net-
works. In: ECCV 3
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mo-
bileNetV2: Inverted Residuals and Linear Bottlenecks. In: CVPR
2, 6, 7
Sharma H, Park J, Suda N, Lai L, Chau B, Chandra V, Esmaeilzadeh H
(2018) Bit fusion: Bit-level dynamically composable architecture
for accelerating deep neural network. In: ISCA 2, 6, 9
Umuroglu Y, Rasnayake L, Sjalander M (2018) Bismo: A scalable bit-
serial matrix multiplication overlay for reconfigurable computing.
In: FPL 2, 6
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful
visual performance model for multicore architectures. Communi-
cations of the ACM 52(4):65–76 7
Xilinx (2018a) Ultrascale architecture and product data sheet:
Overview. URL https://www.xilinx.com/support/
documentation/data_sheets/ds890-ultrascale-overview.
pdf 6
Xilinx (2018b) Zynq-7000 soc data sheet: Overview. URL
https://www.xilinx.com/support/documentation/data_
sheets/ds190-Zynq-7000-Overview.pdf 6
Yang TJ, Chen YH, Sze V (2016) Designing energy-efficient convolu-
tional neural networks using energy-aware pruning. arXiv 3
Yang TJ, Howard A, Chen B, Zhang X, Go A, Sandler M, Sze V, Adam
H (2018) Netadapt: Platform-aware neural network adaptation for
mobile applications. In: ECCV 3
Zhou A, Yao A, Wang K, Chen Y (2018) Explicit loss-error-aware
quantization for low-bit deep neural networks. In: Proceedings of
14 International Journal of Computer Vision
the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp 9426–9435 3
Zhou S, Ni Z, Zhou X, Wen H, Wu Y, Zou Y (2016) DoReFa-Net -
Training Low Bitwidth Convolutional Neural Networks with Low
Bitwidth Gradients. arXiv 3
Zhu C, Han S, Mao H, Dally W (2017) Trained Ternary Quantization.
In: ICLR 2, 3
Zoph B, Le QV (2017) Neural Architecture Search with Reinforcement
Learning. In: ICLR 3
