6.7ms on Mobile with over 78% ImageNet Accuracy: Unified Network Pruning
  and Architecture Search for Beyond Real-Time Mobile Acceleration by Li, Zhengang et al.
6.7ms on Mobile with over 78% ImageNet Accuracy: Unified
Network Pruning and Architecture Search for Beyond
Real-Time Mobile Acceleration
Zhengang Li1, Geng Yuan1, Wei Niu2, Yanyu Li1, Pu Zhao1, Yuxuan Cai1, Xuan Shen1, Zheng Zhan1,
Zhenglun Kong1, Qing Jin1, Zhiyu Chen3, Sijia Liu4, Kaiyuan Yang3, Bin Ren2, Yanzhi Wang1, Xue Lin1
1Northeastern University, 2College of William and Mary,
3Rice University, 4MIT-IBM Watson AI Lab, IBM Research
{li.zhen, yuan.geng, yanz.wang, xue.lin}@northeastern.edu
Abstract—With the increasing demand to efficiently deploy
DNNs on mobile edge devices, it becomes much more important
to reduce unnecessary computation and increase the execution
speed. Prior methods towards this goal, including model com-
pression and network architecture search (NAS), are largely
performed independently, and do not fully consider compiler-
level optimizations which is a must-do for mobile acceleration. In
this work, we first propose (i) a general category of fine-grained
structured pruning applicable to various DNN layers, and (ii) a
comprehensive, compiler automatic code generation framework
supporting different DNNs and different pruning schemes, which
bridge the gap of model compression and NAS. We further
propose NPAS, a compiler-aware unified network pruning and
architecture search. To deal with large search space, we propose
a meta-modeling procedure based on reinforcement learning with
fast evaluation and Bayesian optimization, ensuring the total
number of training epochs comparable with representative NAS
frameworks. Our framework achieves 6.7ms, 5.9ms, and 3.9ms
ImageNet inference times with 78.2%, 75% (MobileNet-V3 level),
and 71% (MobileNet-V2 level) Top-1 accuracy respectively on
an off-the-shelf mobile phone, consistently outperforming prior
work.
I. INTRODUCTION
The growing popularity of mobile AI applications and the
demand for real-time Deep Neural Network (DNN) executions
raise significant challenges for DNN accelerations. However,
the ever-growing size of DNN models causes intensive com-
putation and memory cost, which impedes the deployment on
resource limited mobile devices.
DNN weight pruning [21], [27], [28], [54], [71] has been
proved as an effective model compression technique that can
remove redundant weights of the DNN models, thereby re-
ducing storage and computation costs simultaneously. Existing
work mainly focus on unstructured pruning scheme [21], [24],
[46] where arbitrary weight can be removed, and (coarse-
grained) structured pruning scheme [45], [50], [54], [82], [84],
[85] to eliminate whole filters/channels. The former results in
high accuracy but limited hardware parallelism (and accelera-
tion), while the latter is the opposite. Another active research
area is the Neural Architecture Search (NAS) [86], which
designs more efficient DNN architectures using automatic
searching algorithms. EfficientNet [69] and MobileNetV3 [30]
are representative lightweight networks obtained by using NAS
approaches. Recently, hardware-aware NAS [8], [33], [68],
[73] has been investigated targeting acceleration on actual
hardware platforms.
Different from the prior work on coarse-grained pruning
and NAS that find a smaller, yet regular, DNN structure,
recent work [16], [48], [58] propose to prune the weights in a
more fine-grained manner, e.g., assigning potentially different
patterns to kernels. Higher accuracy can be achieved as a result
of the intra-kernel flexibility, while high hardware parallelism
(and mobile inference acceleration) can be achieved with the
assist of compiler-level code generation techniques [58]. This
work reveals a new dimension of optimization: With the aid of
advanced compiler optimizations, it is possible to achieve high
accuracy and high acceleration simultaneously by injecting a
proper degree of fine granularity in weight pruning. Despite
the promising results, pattern-based pruning [48], [58] is only
applied to 3×3 convolutional (CONV) layers, which limits the
applicability.
As the first contribution, we propose a general category of
fine-grained structured pruning schemes that can be applied to
various DNN layers, i.e., block-punched pruning for CONV
layers with different kernel sizes, and block-based pruning
for FC layers. We develop a comprehensive, compiler-based
automatic code generation framework supporting the proposed
pruning schemes in a unified manner, supporting other types of
pruning schemes, and different schemes for different layers. We
show (i) the advantage of the proposed fine-grained structured
pruning in both accuracy and mobile acceleration, and (ii) the
superior end-to-end acceleration performance of our compiler
framework on both dense (before pruning) and sparse DNN
models.
While our compiler optimizations provide notable mobile
acceleration and support of various sparsity schemes, it in-
troduces a much larger model optimization space: Different
kernel sizes (1×1, 3×3, etc.) result in different acceleration
performances under compiler optimizations, so do different
sparsity schemes. Thus, it is desirable to perform a compiler
aware, joint network pruning and architecture search, deter-
mining the filter type and size, as well as pruning scheme and
rate, for each individual layer. The objective is to maximize






















mobile device. The DNN latency will be actually measured
on the target mobile device, thanks to the fast auto-tuning
capability of our compiler for efficient inference on different
mobile devices.
We develop the compiler-aware NPAS framework to fulfill
the above goal. It consists of three phases: (1) replacement of
mobile-unfriendly operations, (2) the core search process, and
(3) pruning algorithm search. The overall latency constraint
is satisfied through the synergic efforts of (i) incorporating
the overall DNN latency constraint into the automatic search
in Phase 2, and (ii) the effective search of pruning algo-
rithm and performing weight training/pruning accordingly.
As Phase 2 exhibits a larger search space than prior NAS
work, to perform efficient search, we propose a meta-modeling
procedure based on reinforcement learning (RL) with fast
evaluation and Bayesian optimization. This will ensure the
total number of training epochs comparable with representative
NAS frameworks.
Our key contributions include:
• We propose a general category of fine-grained structured
pruning applicable to various DNN layers, and a compre-
hensive, compiler code generation framework supporting
different pruning schemes. We bridge the gap between
model compression and NAS.
• We develop a compiler-aware framework of joint network
pruning and architecture search, maximizing accuracy
while satisfying inference latency constraint.
• We design a systematic search acceleration strategy, in-
tegrating pre-trained starting points, fast accuracy and
latency evaluations, and Bayesian optimization.
• Our NPAS framework achieves by far the best mobile
acceleration: 6.7ms, 5.9ms, and 3.9ms ImageNet infer-
ence times with 78.2%, 75%, and 71% Top-1 accuracy,
respectively, on an off-the-shelf mobile phone.
II. RELATED WORKS
A. Network Pruning
Existing weight pruning research can be categorized accord-
ing to pruning schemes and pruning algorithms.
Pruning Scheme: Previous weight pruning work can be cat-
egorized into multiple major groups according to the pruning
scheme: unstructured pruning [21], [24], [51], coarse-grained
structured pruning [17], [26], [29], [39], [46], [47], [71], [79],
[81], and pattern-based pruning [48], [49], [58].
Unstructured pruning (Fig. 1 (a) and (b)) removes weights
at arbitrary position. Though it can significantly decrease the
number of weights in DNN model as a fine-grained pruning
scheme, the resulted sparse and irregular weight matrix with
indices damages the parallel implementations and results in
limited acceleration on hardware platforms.
To overcome the limitation in unstructured, irregular weight
pruning, many work [17], [26], [29], [39], [44], [46], [47],
[71], [79], [81] studied the coarse-grained structured pruning at
the level of filters and channels as shown in Fig. 1 (c) and (d).
With the elimination of filters or channels, the pruned model
still maintains the network structure with high regularity which
can be parallelized on hardware. The downside of coarse-
grained structured pruning is the obvious accuracy degradation
by removing the whole filters/channels, which limits model
compression rate.
Fig. 1 (e) shows the pattern-based pruning [48], [49], [58]
as a representative fine-grained structured pruning scheme. It
assigns a pattern (from a predefined library) to each CONV
kernel, maintaining a fixed number of weights in each kernel.
As shown in the figure, each kernel reserves 4 non-zero
weights (on a pattern) out of the original 3×3 kernels. Besides
being assigned a pattern, a kernel can be completely removed
to achieve higher compression rate. Pattern-based pruning can
simultaneously achieve high accuracy (thanks to the structural
flexibility) and high inference acceleration with the aid of
compiler-based executable code generation. Note that com-
piler support [58] is necessary for pattern-based pruning to
deliver its promise on mobile acceleration. A limitation is that
pattern-based pruning is limited to 3×3 CONV layers in cur-
rent work: 5×5 or larger kernel size results in a large number
of pattern types, which incurs notable computation overheads
in compiler-generated executable codes. 1×1 CONV layers
and FC layers leave no space of designing different patterns
for a kernel.
Pruning Algorithm: Two main categories exist: heuris-
tic pruning algorithm [17], [21], [23], [47], [79] and
regularization-based pruning algorithm [26], [28], [29], [39],
[46], [71], [80], [81]. Heuristic pruning was firstly performed
in an iterative, magnitude-based manner on unstructured prun-
ing [23], and gets improved in later work [21]. Heuristic prun-
ing has also been incorporated into coarse-grained structured
pruning [17], [47], [79].
Regularization-based algorithm uses mathematics-oriented
method to deal with the pruning problem. Early work [29],
[71] incorporates `1 or `2 regularization in loss function to
solve filter/channel pruning problems. Later work [26] makes
the regularization penalty “softer” which allows the pruned
filters to be updated during the training procedure. In [39],
[81], an advanced optimization solution framework ADMM
(Alternating Direction Methods of Multipliers) is utilized to
achieve dynamic regularization penalty which significantly
reduces accuracy loss. In [28], Geometric Median is proposed
to conduct filter pruning.
B. Neural Architecture Search (NAS)
In general, NAS can be classified into the following cate-
gories by its searching strategy. Reinforcement Learning (RL)
methods [3], [7], [41], [59], [83], [86], [87] employ Recurrent
Neural Network (RNN) as predictor, with parameters updated
by the accuracy of child network validated over a proxy
dataset. Evolution methods [18], [42], [53], [61], [62], [74],
[74] develop a pipeline of parent initialization, population
updating, generation and elimination of offsprings.
One-shot NAS [4], [6], [12], [22], [78] trains a large one-
shot model containing all operations and shares the weight














2-D Weight Matrix Format4-D Weight Tensor Format
(a) (b)
ch 1 ch 2 ch n













ch 1 ch 2 ch n
CONV Layer FC Layer
2-D Weight Matrix Format
Pattern-based
pruning








































Fig. 1. Different weight pruning schemes for CONV and FC layers using 4D tensor and 2D matrix representation.
[8], [10], [13], [19], [43], [73], [76] propose a differentiable
algorithm distinct from prior discrete search, reducing search-
ing cost while still getting comparable results. Bayesian opti-
mization [5], [15], [34], [52], [63], [72] uses optimal transport
program to compute the distance of network architectures.
Some recent work realize the importance of hardware co-
design and incorporate the inference latency into NAS, which
is more accurate than the intuitive volume estimation like Mul-
tiply–Accumulate operations (MACs) [8], [68], [73]. MnasNet
[68] utilizes latency on mobile device as the reward to perform
RL search, where gradient-based NAS work FBNet [73] and
ProxylessNAS [8] add a latency term to the loss function.
However, none of these hardware-targeting work fully exploit
the potential of compiler optimizations or satisfy an overall
latency requirement, not to mention accounting for compiler-
supported sparse models. This motivates us to investigate
another dimension of model optimization, that is, compiler-
aware, latency-constrained, architecture and pruning co-search.
C. Compiler-assisted DNN Frameworks on Mobile
Recently, mobile-based, compiler-assisted DNN execution
frameworks [25], [32], [37], [38], [75], [77] have drawn broad
attention from both industry and academia. TensorFlow-Lite
(TFLite) [1], Alibaba Mobile Neural Network (MNN) [2],
and TVM [9] are representative state-of-the-art DNN inference
frameworks. Various optimization techniques, such as varied
computation graph optimizations and half-float support, have
been employed to accelerate the DNN inference on mobile
devices (mobile CPU and GPU). Recent work PatDNN [58]
and PCONV [48] employ a set of compiler-based optimiza-
tions to support specific pattern-based sparse DNN models
to accelerate the end-to-end inference on mobile devices.
However, the lack of support for different types of layers (e.g.,
1×1 CONV, 5×5 CONV, and FC) limits the versatility of such
framework.
III. PROPOSED FINE-GRAINED STRUCTURED PRUNING
Pattern-based pruning scheme [48], [49], [58], as mentioned
in Section II-A, reveals a new optimization dimension of fine-
grained structured pruning that can achieve high accuracy
and high inference acceleration simultaneously with the assist
of compiler optimizations. As pattern-based pruning is only
applicable to 3 × 3 CONV layers, we propose a general
category of fine-grained structured pruning scheme that can
be applied to various DNN layers: block-based pruning for
FC layers and block-punched pruning for CONV layers with
different kernel sizes.
Block-based Pruning: Fig. 1 (g) shows the block-based
pruning scheme in 2D weight matrix format for FC layers.
The entire weight matrix is divided into a number of equal-
sized blocks, then the entire column(s) and/or row(s) of
weights are pruned within each block. Compared to the coarse-
grained structured pruning, block-based pruning provides a
finer pruning granularity to better preserve the DNN model
accuracy. With an appropriate block size selected, the remain-
ing computation within a block can still be parallelized on
mobile device with the help of compiler. As a result, block-
based pruning can achieve comparable hardware (inference)
performance as coarse-grained structured pruning, under the
same overall pruning rate.
Block-punched Pruning: The CONV layers prefer the
tensor-based representation and computation rather than
matrix-based computation used for FC layers. Inspired by
block-based pruning, we develop block-punched pruning
scheme tailored for CONV layers, which can be accelerated
using the same compiler optimizations. As shown in Fig. 1 (f),
block-punched pruning requires pruning a group of weights
at the same location of all filters and all channels within a
block to leverage hardware parallelism from both memory
and computation perspectives. With effective compiler-level
executable code generation, high hardware parallelism (and
inference acceleration on mobile) can also be achieved.
Compiler Optimizations: We develop a comprehensive,
compiler-based automatic code generation framework sup-
porting the proposed (block-punched/block-based) pruning
schemes in a unified manner. It also supports other pruning
schemes such as unstructured, coarse-grained, pattern-based
pruning. In fact, unstructured and coarse-grained structured
pruning schemes are just special cases of block-punched
pruning, the former with block size 1× 1 and the latter with
block size of the whole weight tensor/matrix. A novel layer
fusion technique is developed, which is critical to the efficient
implementation of super-deep networks. Fast auto-tuning ca-

























Fig. 2. Accuracy vs. Latency with different block sizes on ImageNet using
ResNet-50 under uniform 6× pruning rate.
Sample Results and Block Size Determination:
Fig. 2 shows example results of the accuracy vs. latency
when applying block-punched pruning on ResNet-50 with dif-
ferent block sizes. A uniform pruning rate (i.e., 6×) and block
size are adopted through all layers. Under the same pruning
rate, unstructured pruning (i.e., 1×1 block size) preserves the
highest accuracy but has the worst performance in latency.
On the contrary, coarse-grained structured pruning (i.e., whole
weight matrix as a block) achieves the lowest latency but with
a severe accuracy degradation. The results of block-punched
pruning show high accuracy and high inference speed (low
latency) simultaneously.
The reason is that the maximum hardware parallelism is
limited by computation resources. Thus, even when dividing
the weights into blocks, each block’s remaining weights are
still sufficient to fulfill on-device hardware parallelism, espe-
cially on resource-limited mobile devices.
One reasonable block size determination strategy is to let the
number of channels contained in each block match the length
of the vector register (e.g., 4) on the target mobile CPU/GPU
to ensure high parallelism. Then determine the number of
filters to be contained (e.g., 8) by considering the given design
targets.
IV. MOTIVATION OF COMPILER-AWARE UNIFIED
OPTIMIZATION FRAMEWORK
Our compiler optimizations provide notable acceleration of
different filter types, and support for various sparsity schemes.
A key observation is that different filter types and sparsity
schemes have different acceleration performance under com-
piler optimizations (when computation (MACs) is the same).
The following are measured on mobile CPU (Qualcomm Kryo
485) of a Samsung Galaxy S10 phone.
Different Filter Types (Kernel Sizes): Fig. 3 (a) shows
the latency vs. computation (MACs) of a CONV layer with
different kernel sizes. We fix the input feature map to 56×56
and change the number of filters. Under the same computation,
3×3 kernels achieve the best performance, where the 1×1 ker-
nels are the second. Because 3×3 kernels can be accelerated
using Winograd algorithm, and makes it the most compiler-
friendly; while 1×1 kernels result in no input redundancy
in GEMM computation, which also relieves the burden on
compiler optimizations.
Different Pruning Schemes: Fig. 3 (b) shows the compu-
tation speedup vs. pruning rate of a 3×3 CONV layer with
different pruning schemes. We choose the input feature map
size of 56×56 and 256 input and output channels. We can
observe that, with compiler optimizations, fine-grained prun-
ing schemes (i.e., pattern-based and block-punched pruning)
consistently outperform the unstructured pruning and achieve
comparable acceleration compared to the coarse-grained struc-
tured pruning below 5× pruning. Since, under reasonable
pruning rate of fine-grained structured pruning schemes, the
remaining weights in each layer are still sufficient to fully
utilize hardware parallelism.
Impact of Number of Layers: The number of computation
layers is another critical factor that affects inference latency.
To show the impact, we make a narrower-but-deeper version
of ResNet-50 by doubling the number of layers, while keeping
computation MACs the same as the original ResNet-50. And
the inference speed of the narrower-but-deeper version is
1.22× slower than the original one using mobile GPU (44ms
vs. 36ms). The main reason is that a larger number of layers
introduce more intermediate results and hence more frequent
data access to the main memory. And the mobile CPU/GPU
cannot be fully utilized due to a large number of memory-
intensive layers.
Based on the above observations, it is desirable to perform
a compiler-aware network pruning and architecture search,
determining the filter type and size, as well as pruning scheme
and rate for each individual layer. The objective is to maximize
DNN accuracy satisfying an inference latency constraint when






























































50 100 150 200
(b)
Fig. 3. (a) Latency vs. Computation with different filter types, (b) speedup
vs. pruning rate with different pruning schemes.
V. PROPOSED UNIFIED NETWORK PRUNING AND
ARCHITECTURE SEARCH (NPAS) ALGORITHM
A. Overview of NPAS Framework
Fig. 4 shows the proposed NPAS framework. To take
advantage of recent NAS results and accelerate the NPAS
process, we start from a pre-trained DNN model, and go
through three phases as shown in the figure.
Phase 1: Replacement of Mobile-Unfriendly Operations:
Certain operators are inefficient to execute on mobile de-
vices (mobile CPU and GPU). For instance, certain activa-
tion functions, such as sigmoid, swish, require exponential
computation, and can become latency bottleneck on mobile
inference. These unfriendly operations will be replaced by
compiler-friendly alternatives such as hard-sigmoid and hard-
swish, with negligible effect on accuracy.
Phase 2: NPAS Scheme Search: This phase generates
and evaluates candidate NPAS schemes, defined by the col-
lection of per-layer filter types, per-layer pruning schemes
and rates, and finally chooses the best-suited one. As per-



































Fig. 4. Overview of the proposed NPAS framework.
2 exhibits a much larger search space than prior NAS, which
renders representative NAS algorithms like RL-based ones
ineffective. To accelerate such search, we present a meta-
modeling procedure based on RL with Bayesian Optimization
(BO), with details in Section V-B. A fast accuracy evaluation
method is developed, tailored to NPAS framework.
Moreover, we incorporate the overall DNN latency con-
straint effectively in the reward function of NPAS scheme
search, ensuring that such constraint can be satisfied at the
search outcome. The overall DNN latency is actually measured
on the target mobile CPU/GPU based on the candidate NPAS
scheme currently under evaluation. We rely on actual measure-
ment instead of per-layer latency modeling as many prior NAS
work. This is because our advanced compiler optimizations
incorporate a strong layer fusion beyond prior compiler work,
which is critical for efficient implementation of super-deep
networks, and will make per-layer latency modeling less
accurate.
Phase 3: Pruning Algorithm Search: The previous phase
has already determined the per-layer pruning schemes and
rates, so that the compiler-generated codes can satisfy the
overall latency constraint. The remaining task of this phase is
to search for the most desirable pruning algorithm to perform
actual pruning and train the remaining weights1. As the per-
layer pruning rates are already determined, the candidate
pruning algorithms to select from are limited to those with
pre-defined per-layer pruning rates, including magnitude-based
ones [20], [23], ADMM-based algorithm [39], [81], etc. As
an extension over prior work, we generalize these algorithms
to achieve different sparsity schemes with the help of group-
Lasso regularization [35], [71]. In Phase 3, we compare the
resulted DNN accuracy from the candidate pruning algorithms
in a few epochs, select the one with the highest accuracy, and
continue a best-effort algorithm execution to derive the final
DNN model and compiled codes.
B. Details of Phase 2: NPAS Scheme Search
TABLE I
NPAS SEARCH SPACE FOR EACH DNN LAYER
Filter
type
{1×1, 3×3, 3×3 DW & 1×1,
1×1 & 3×3 DW & 1×1, skipping} 1
Pruning
scheme
{Filter [85], Pattern-based [58],
Block-punched/block-based}
Pruning rate { 1×, 2×, 2.5×, 3×, 5×, 7×, 10× }
1 & denotes cascade connection.
1) Search Space of NPAS in Phase 2: Per-layer filter
types: As different filter types (kernel sizes) have different
acceleration performance under compiler optimizations, the
NPAS search space includes replacing the original filter type
with 1×1, 3×3, a cascade of 3×3 depth-wise (DW) and
1×1 convolutions, a cascade of 1×1 and 3×3 DW and 1×1
1The above process cannot be accomplished by the fast accuracy evaluation
in Phase 2 as we need to limit the number of training epochs.
convolutions, or directly skipping the entire layer. The first
two are most preferable with compiler optimizations (please
refer to Section IV), and the cascade connection is shown in
prior work [31], [64] to provide the same accuracy with less
computation.
Per-layer pruning schemes:
The NPAS agent can choose from filter (channel) pruning
[85], pattern-based pruning [58] and block-punched/based
pruning for each layer. As different layers may have different
compatible pruning schemes, we allow the NPAS the flexibility
to choose different pruning schemes for different layers. This
is well supported by our compiler code generation.
Per-layer pruning rate: We can choose from the list
{1×, 2×, 2.5×, 3×, 5×, 7×, 10×} (1× means no pruning).
2) Q-Learning Training Procedure: As per-layer pruning
scheme and rate is integrated in NPAS scheme search, the
search space is beyond that of conventional NAS. To en-
sure fast search, we employ the RL algorithm Q-learning
as the base technique, assisted by fast evaluation (Sec-
tion V-B3) and Bayesian optimization (BO) (Section V-B4)
for search speedup. The Q-learning algorithm consists of
an NPAS agent, states and a set of actions. For the state
of the i-th layer in a given DNN, it is defined as a tu-
ple of filter type, pruning scheme, and pruning rate i.e.,
{filter typei, pruning schemei, pruning ratei}, and
each can be selected from the corresponding search space. We
add the layer depth to the state space to constrict the action
space such that the state-action graph is directed and acyclic
(DAG). For action space, we allow transitions for a state with
layer depth i to a state with layer depth i + 1, ensuring that
there are no loops in the graph. This constraint ensures that the
state-action graph is always a DAG. When layer depth reaches
the maximum layer depth, the transition terminates. Based on
above-defined state s ∈ S and action a ∈ A, we adopt Q-
learning procedure [70] to update Q-values. We specify final
and intermediate rewards as follows:




where V is the validation accuracy of the model, h is the model
inference speed or latency (actually measured on a mobile
device), and H is the threshold for the latency requirement.
Generally, rT is high when the model satisfies the real-
time requirement (h < H) with high evaluation accuracy.
Otherwise the final reward is small, especially when the
latency requirement is violated. For the intermediate reward rt
which is usually ignored by setting it to zero [3] as it cannot
be explicitly measured, the reward shaping [57] is employed
as shown above to speed up the convergence. Setting rt = 0
could make the Q-value of sT much larger than others in the
early stage of training, leading to an early stop of searching
for the agent. We adopt the ε-greedy strategy [55] to choose
actions. In addition, as the exploration space is large, the
experience replay technique is adopted for faster convergence
[40].
3) Fast Evaluation Methods: We develop and adopt multi-
ple tailored acceleration strategies to facilitate fast evaluation
in NPAS scheme search.
Unidirectional Filter Type Replacement: The NPAS
scheme search needs to satisfy a pre-defined DNN latency
constraint. Thus, we follow the principle of not increasing
kernel size to search per-layer filter type, which can effectively
reduce search space. For example, we will no longer search
the filter type for 1×1 layers in the original model.
Weight Initialization for Filter Type Candidates: The
weights of the filter type candidate operators in each layer
can be pre-trained before NPAS scheme search (Phase 2) very
quickly using reconstruction error, which can make them act
similarly to the original operations. As a result, the accuracy
evaluation process can be significantly accelerated.
One-shot Pruning and Early Stopping for Fast Accu-
racy Evaluation: During the accuracy evaluation process, we
follow the pruning scheme and pruning rate (for a specific
layer) in a candidate NPAS scheme, and conduct a one-shot
pruning (on the target layer) based on weight magnitude. This
straightforward pruning will result in accuracy degradation.
But after a couple of epochs of retraining, it can distinguish
the relative accuracy of different NPAS schemes.
Overlapping Compiler Optimization and Accuracy Eval-
uation: We use compiler code generation and actual on-device
latency measurement because of (i) higher accuracy than per-
layer latency modeling due to layer fusion mechanism, and (ii)
the fast auto-tuning capability of compiler to different mobile
devices. Please note that the compiler code generation and
latency measurement do not need the absolute weight values.
Compiler code generation is much faster than DNN training
(even a single epoch), and can be performed in parallel with
accuracy evaluation (as accurate weight values are not needed).
As a result, it will not incur extra time consumption to NPAS.
4) Bayesian Predictor for Reducing Evaluations: As per-
forming evaluation on a large amount of sampled NPAS
schemes is time-consuming, we build a predictor with BO
[11], [36], [67]. The NPAS agent generates a pool of NPAS
schemes. We first use BO to select a small number of NPAS
schemes with potentially high rewards from the pool. Then
the selected NPAS schemes are evaluated to derive more
accurate rewards. We reduce the evaluation of NPAS schemes
with possibly weak performance, thereby reducing the overall
scheme evaluation effort.
We build a predictor combining Gaussian process (GP)
with a Weisfeiler-Lehman subtree (WL) graph kernel [56],
[66] to handle the graph-like NPAS schemes. The WL kernel
compares two directed graphs in iterations. In the m-th WL
iteration, it first obtains the histogram of graph features φm(s)






where kbase is a base kernel and
we employ dot product here. The iterative procedure stops











where wm contains the weights for each WL iteration m,
which is set to equal for all m following [66]. The Expected
Improvement [60] is employed as the acquisition function in
the work. Algorithm 1 provides a summary.
Algorithm 1 Q-learning with Bayesian Predictor Algorithm
Input: Observation data D, BO batch size B, BO acquisition
function α(·)
Output: The best NPAS scheme s
for steps do
Generate a pool of candidate NPAS schemes Sc;
Select {ŝi}Bi=1 = argmaxs∈Sc α(s|D);
Evaluate the scheme and obtain reward {ri}Bi=1 of {ŝi}Bi=1;
Update Q values based on Q-learning with reward;
D ← D ∪ ({si}Bi=1, {ri}Bi=1);
Update GP of BO with D;
end for
VI. RESULTS AND EVALUATION
A. Experimental Setup
In this section, we use the image classification task and
ImageNet dataset [14] to show the effectiveness of our frame-
work. All training processes use the SGD optimizer with a
momentum rate set to 0.9 and weight decay set to 0.0005 and
use the batch size of 2048 per node. The starting learning rate
is set to 0.001, and the cosine learning rate scheduler is used
if not specified in our paper. For Phase 1, we conduct a fast
fine-tuning with 5 training epochs after replacing the mobile-
unfriendly operations (only once for the entire NPAS process).
In Phase 2, 40 Nvidia Titan RTX GPUs are used to conduct
the fast accuracy evaluation for candidate NPAS schemes
concurrently. Since we start from a well-trained model, we
retrain 2 epochs for each candidate one-shot pruned model
for fast evaluation. For each candidate model, we measure
100 runs of inference on target mobile devices and use the
average value as end-to-end latency.
In Phase 3, we search the most desirable pruning algorithm
including magnitude-based algorithm, ADMM-based algo-
rithm [39], [81] and geometric median-based algorithm [28]
(only for filter pruning). We adopt 100 epochs for weight
pruning and 100 epochs on remaining weights fine-tuning with
knowledge distillation [65].
The overall GPU days are varied based on pre-trained
network and are reduced thanks to our fast evaluation and
BO. For example, using EfficientNet-B0 as starting point, the
overall searching time is 15 days, where Phase 1 only takes 5
epochs, and Phase 3 takes 1.5 days.
B. Evaluation Results
In Fig. 5 and 6, we compare our accuracy and latency
results with representative DNN inference acceleration frame-
work MNN, PyTorch Mobile, and TFLite. Four dense DNN
models are used for the comparisons, which are MobileNet-
V3, EfficientNet-B0, shrunk versions of EfficientNet-B0 to
70% original computation and 50% original computation.
using mobile CPU (Qualcomm Kryo 485) or mobile GPU
(Qualcomm Adreno 640). PyTorch Mobile does not support






































(12.3, 75.2) (22.0, 75.2)(18.0, 75.2)





































Mobile GPU Latency (ms)































(7.4, 71.5) (12.1, 71.5)
(7.9, 75.2)
(16.0, 75.2)












17   18   19 
Fig. 6. Accuracy vs. Latency comparison on mobile GPU.
mobile GPU, so no corresponding results. EfficientNet-B0 is
used as our pretrained model.
First, without incorporating NPAS, one can observe that
our compiler optimizations can effectively speed up the same
DNN inference, up to 46% and 141% (on MobileNet-V3),
compared with the currently best framework MNN on mobile
CPU and GPU, respectively. The red star shapes in the figures
represent the NPAS generated results under different latency
constraints. Our NPAS results consistently outperform the
representative DNN models, and achieve the Pareto optimality
in terms of accuracy and inference latency. With the highest
accuracy (78.2% Top-1) among the reference models, the end-
to-end inference time of NPAS solution (385M MACs) is only
11.8ms and 6.7ms on mobile CPU and GPU, respectively.
With MobileNet-V3 level accuracy (75% Top-1), our inference
time (201M MACs) is 9.8ms and 5.9ms, respectively. With
MobileNet-V2 level accuracy (71% Top-1), the inference time
of NPAS solution (147M MACs) is 6.9ms and 3.9ms, respec-
tively. To the best of our knowledge, this is never accomplished
by any existing NAS or weight pruning work. Our models and
codes will be released upon publication.
Table II shows the model details, with representative hand-
crafted and hardware-aware NAS models as references. One
TABLE II
COMPARISON RESULTS OF NPAS AND REPRESENTATIVE LIGHTWEIGHT NETWORKS.
A. / P. Search Params CONV MACs Accuracy (Top-1/5) Latency (CPU/GPU) Device
MobileNet-V1 [31] N./N. 4.2M 575M 70.6 / 89.5 - / - -
MobileNet-V2 [64] N./N. 3.4M 300M 72.0 / 91.0 - / - -
MobileNet-V3 [30] Y./N. 5.4M 227M 75.2 / 92.2 - / - -
NAS-Net-A [87] Y./N. 5.3M 564M 74.0 / 91.3 183ms / NA Google Pixel 1
AmoebaNet-A [62] Y./N. 5.1M 555M 74.5 / 92.0 190ms / NA Google Pixel 1
MnasNet-A1 [68] Y./N. 3.9M 312M 75.2 / 92.5 78ms / NA Google Pixel 1
ProxylessNas-R [8] Y./N. NA NA 74.6 / 92.2 78ms / NA Google Pixel 1
NPAS (ours) Y./N. 5.3M 385M 78.2 / 93.9 11.8ms / 6.7ms Galaxy S10
NPAS (ours) Y./Y. 3.5M 201M 75.0 / 92.0 9.8ms / 5.9ms Galaxy S10
NPAS (ours) Y./Y. 3.0M 147M 70.9 / 90.5 6.9ms / 3.9ms Galaxy S10
NPAS (ours) Y./Y. 2.8M 98M 68.3 / 89.4 5.6ms / 3.3ms Galaxy S10
can observe the computation (MACs) reduction under the same
accuracy compared with the prior references, thanks to the
joint network pruning and search. One can also observe the
huge gap in latency compared with these prior work, as neither
of compiler optimizations nor compiler-aware optimizations
are accounted for. In fact, this is common, as the fastest
MobileNet-V3 inference we have found in NAS reference is
51ms on mobile CPU, while we found its inference time rang-
ing from 18ms to 29ms using baseline compiler frameworks
(Fig. 5). This gap is the reason we believe that compiler
optimizations and awareness will contribute significantly to
DNN accelerations.
VII. CONCLUSION
In this work, we propose (i) a fine-grained structured
pruning applicable to various DNN layers, and (ii) a compiler
automatic code generation framework supporting different
DNNs and different pruning schemes, which bridge the gap
of model compression and NAS. We further propose NPAS,
a compiler-aware unified network pruning and architecture





[3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. De-
signing neural network architectures using reinforcement learning. In
International Conference on Learning Representations (ICLR), 2017. 2,
6
[4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan,
and Quoc Le. Understanding and simplifying one-shot architecture
search. In International Conference on Machine Learning, pages 550–
559, 2018. 2
[5] James Bergstra, Daniel Yamins, and David Cox. Making a science of
model search: Hyperparameter optimization in hundreds of dimensions
for vision architectures. In International conference on machine learn-
ing, pages 115–123, 2013. 3
[6] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston.
Smash: one-shot model architecture search through hypernetworks.
arXiv preprint arXiv:1708.05344, 2017. 2
[7] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang.
Efficient architecture search by network transformation. arXiv preprint
arXiv:1707.04873, 2017. 2
[8] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neu-
ral architecture search on target task and hardware. arXiv preprint
arXiv:1812.00332, 2018. 1, 3, 8
[9] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,
Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze,
Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-
end optimizing compiler for deep learning. In 13th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 18), pages
578–594, 2018. 3
[10] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive darts: Bridging
the optimization gap for nas in the wild. International Journal of
Computer Vision, pages 1–18, 2020. 3
[11] Yutian Chen, Aja Huang, Ziyu Wang, Ioannis Antonoglou, Julian
Schrittwieser, David Silver, and Nando de Freitas. Bayesian optimization
in alphago. arXiv preprint arXiv:1812.06855, 2018. 6
[12] Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fairnas:
Rethinking evaluation fairness of weight sharing neural architecture
search. arXiv preprint arXiv:1907.01845, 2019. 2
[13] Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. Fair darts:
Eliminating unfair advantages in differentiable architecture search. arXiv
preprint arXiv:1911.12126, 2019. 3
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-
Fei. Imagenet: A large-scale hierarchical image database. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pages 248–255. IEEE, 2009. 7
[15] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding
up automatic hyperparameter optimization of deep neural networks by
extrapolation of learning curves. In Twenty-Fourth International Joint
Conference on Artificial Intelligence, 2015. 3
[16] Peiyan Dong, Siyue Wang, Wei Niu, Chengming Zhang, Sheng Lin,
Zhengang Li, Yifan Gong, Bin Ren, Xue Lin, Yanzhi Wang, et al.
Rtmobile: Beyond real-time mobile acceleration of rnns for speech
recognition. arXiv preprint arXiv:2002.11474, 2020. 1
[17] Xuanyi Dong and Yi Yang. Network pruning via transformable archi-
tecture search. In Advances in Neural Information Processing Systems
(NeurIPS), pages 759–770, 2019. 2
[18] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-
objective neural architecture search via lamarckian evolution. arXiv
preprint arXiv:1804.09081, 2018. 2
[19] Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and
Xinggang Wang. Densely connected search space for more flexible
neural architecture search. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 10628–10637, 2020.
3
[20] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis:
Finding sparse, trainable neural networks. ICLR, 2018. 5
[21] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery
for efficient dnns. In Advances in neural information processing systems
(NeurIPS), pages 1379–1387, 2016. 1, 2
[22] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu,
Yichen Wei, and Jian Sun. Single path one-shot neural architecture
search with uniform sampling. In European Conference on Computer
Vision, pages 544–560. Springer, 2020. 2
[23] Song Han, Huizi Mao, and William J. Dally. Deep compression:
Compressing deep neural networks with pruning, trained quantization
and huffman coding. In International Conference on Learning Repre-
sentations (ICLR), 2016. 2, 5
[24] Song Han, Jeff Pool, John Tran, and William Dally. Learning both
weights and connections for efficient neural network. In Advances
in neural information processing systems (NeurIPS), pages 1135–1143,
2015. 1, 2
[25] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal,
Alec Wolman, and Arvind Krishnamurthy. Mcdnn: An approximation-
based execution framework for deep stream processing under resource
constraints. In Proceedings of the 14th Annual International Conference
on Mobile Systems, Applications, and Services (MobiSys), pages 123–
136. ACM, 2016. 3
[26] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft
filter pruning for accelerating deep convolutional neural networks. In
International Joint Conference on Artificial Intelligence (IJCAI), 2018.
2
[27] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song
Han. Amc: Automl for model compression and acceleration on mobile
devices. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 784–800, 2018. 1
[28] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter
pruning via geometric median for deep convolutional neural networks
acceleration. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 4340–4349, 2019. 1, 2, 7
[29] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerat-
ing very deep neural networks. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), pages 1389–1397, 2017. 2
[30] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo
Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay
Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the
IEEE International Conference on Computer Vision, pages 1314–1324,
2019. 1, 8
[31] Andrew Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.
Mobilenets: Efficient convolutional neural networks for mobile vision
applications. arXiv:1704.04861, 2017. 6, 8
[32] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon:
Mobile gpu-based deep learning framework for continuous vision ap-
plications. In Proceedings of the 15th Annual International Conference
on Mobile Systems, Applications, and Services (MobiSys), pages 82–95.
ACM, 2017. 3
[33] Weiwen Jiang, Xinyi Zhang, Edwin H-M Sha, Lei Yang, Qingfeng
Zhuge, Yiyu Shi, and Jingtong Hu. Accuracy vs. efficiency: Achieving
both through fpga-implementation aware neural architecture search. In
Proceedings of the 56th Annual Design Automation Conference 2019,
pages 1–6, 2019. 1
[34] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas
Poczos, and Eric P Xing. Neural architecture search with bayesian
optimisation and optimal transport. In Advances in neural information
processing systems, pages 2016–2025, 2018. 3
[35] Seyoung Kim and Eric P. Xing. Tree-guided group lasso for multi-
response regression with structured sparsity, with an application to eqtl
mapping. The Annals of Applied Statistics, 6(3):1095–1117, Sep 2012.
5
[36] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank
Hutter. Fast bayesian optimization of machine learning hyperparameters
on large datasets. In Artificial Intelligence and Statistics, pages 528–536.
PMLR, 2017. 6
[37] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For-
livesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software
accelerator for low-power deep learning inference on mobile devices.
In Proceedings of the 15th International Conference on Information
Processing in Sensor Networks, page 23. IEEE Press, 2016. 3
[38] Nicholas D Lane, Petko Georgiev, and Lorena Qendro. Deepear: robust
smartphone audio sensing in unconstrained acoustic environments using
deep learning. In Proceedings of the 2015 ACM International Joint
Conference on Pervasive and Ubiquitous Computing, pages 283–294.
ACM, 2015. 3
[39] Tuanhui Li, Baoyuan Wu, Yujiu Yang, Yanbo Fan, Yong Zhang, and
Wei Liu. Compressing convolutional neural networks via factorized
convolutional filters. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 3977–3986,
2019. 2, 5, 7
[40] Long-Ji Lin. Reinforcement Learning for Robots Using Neural Net-
works. PhD thesis, USA, 1992. 6
[41] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua,
Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy.
Progressive neural architecture search. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 19–34, 2018. 2
[42] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando,
and Koray Kavukcuoglu. Hierarchical representations for efficient
architecture search. arXiv preprint arXiv:1711.00436, 2017. 2
[43] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable
architecture search. arXiv preprint arXiv:1806.09055, 2018. 3
[44] Ning Liu, Xiaolong Ma, Zhiyuan Xu, Yanzhi Wang, Jian Tang, and
Jieping Ye. Autocompress: An automatic dnn structured pruning frame-
work for ultra-high compression rates. arXiv preprint arXiv:1907.03141,
2019. 2
[45] Ning Liu, Xiaolong Ma, Zhiyuan Xu, Yanzhi Wang, Jian Tang, and
Jieping Ye. Autocompress: An automatic dnn structured pruning
framework for ultra-high compression rates. In AAAI, 2020. 1
[46] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor
Darrell. Rethinking the value of network pruning. arXiv preprint
arXiv:1810.05270, 2018. 1, 2
[47] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning
method for deep neural network compression. In Proceedings of the
IEEE international conference on computer vision (ICCV), pages 5058–
5066, 2017. 2
[48] Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma,
Bin Ren, and Yanzhi Wang. Pconv: The missing but desirable sparsity
in dnn weight pruning for real-time execution on mobile devices. In
Thirty-Four AAAI Conference on Artificial Intelligence, 2020. 1, 2, 3
[49] Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Fu-Ming Guo, Sheng
Lin, Hongjia Li, Xiang Chen, Jian Tang, Kaisheng Ma, et al. An
image enhancing pattern-based sparsity for real-time inference on mobile
devices. arXiv preprint arXiv:2001.07710, 2020. 2, 3
[50] Xiaolong Ma, Geng Yuan, Sheng Lin, Caiwen Ding, Fuxun Yu, Tao
Liu, Wujie Wen, Xiang Chen, and Yanzhi Wang. Tiny but accurate:
A pruned, quantized and optimized memristor crossbar framework for
ultra efficient dnn implementation. In ASP-DAC, 2020. 1
[51] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang,
and William J Dally. Exploring the regularity of sparse structure in
convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
2
[52] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springen-
berg, and Frank Hutter. Towards automatically-tuned neural networks.
In Workshop on Automatic Machine Learning, pages 58–65, 2016. 3
[53] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel
Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan,
Nigel Duffy, et al. Evolving deep neural networks. In Artificial
Intelligence in the Age of Neural Networks and Brain Computing, pages
293–312. Elsevier, 2019. 2
[54] Chuhan Min, Aosen Wang, Yiran Chen, Wenyao Xu, and Xin Chen.
2pfpce: Two-phase filter pruning based on conditional entropy. arXiv
preprint arXiv:1809.02220, 2018. 1
[55] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu,
Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, An-
dreas K Fidjeland, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. nature, 518(7540):529–533, 2015. 6
[56] Christopher Morris, Kristian Kersting, and Petra Mutzel. Glocalized
weisfeiler-lehman graph kernels: Global-local feature maps of graphs.
In 2017 IEEE International Conference on Data Mining (ICDM), pages
327–336. IEEE, 2017. 6
[57] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance
under reward transformations: Theory and application to reward shaping.
In ICML, volume 99, pages 278–287, 1999. 6
[58] Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin,
Yanzhi Wang, and Bin Ren. Patdnn: Achieving real-time dnn execution
on mobile devices with pattern-based weight pruning. arXiv preprint
arXiv:2001.00138, 2020. 1, 2, 3, 5, 6
[59] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean.
Efficient neural architecture search via parameter sharing. arXiv preprint
arXiv:1802.03268, 2018. 2
[60] Chao Qin, Diego Klabjan, and Daniel Russo. Improving the expected
improvement algorithm. In Advances in Neural Information Processing
Systems, pages 5381–5391, 2017. 7
[61] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Aging
evolution for image classifier architecture search. In AAAI Conference
on Artificial Intelligence, 2019. 2
[62] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regu-
larized evolution for image classifier architecture search. In Proceedings
of the aaai conference on artificial intelligence, volume 33, pages 4780–
4789, 2019. 2, 8
[63] Binxin Ru, Xingchen Wan, Xiaowen Dong, and Michael Osborne.
Neural architecture search using bayesian optimisation with weisfeiler-
lehman kernel. arXiv preprint arXiv:2006.07556, 2020. 3
[64] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear
bottlenecks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4510–4520, 2018. 6, 8
[65] Zhiqiang Shen and Marios Savvides. Meal v2: Boosting vanilla resnet-
50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint
arXiv:2009.08453, 2020. 7
[66] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt
Mehlhorn, and Karsten M. Borgwardt. Weisfeiler-lehman graph kernels.
Journal of Machine Learning Research, 12(77):2539–2561, 2011. 6, 7
[67] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian
optimization of machine learning algorithms. In Advances in neural
information processing systems, pages 2951–2959, 2012. 6
[68] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark San-
dler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural
architecture search for mobile. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 2820–2828,
2019. 1, 3, 8
[69] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling
for convolutional neural networks. arXiv preprint arXiv:1905.11946,
2019. 1
[70] Christopher John Cornish Hellaby Watkins. Learning from delayed
rewards. 1989. 6
[71] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li.
Learning structured sparsity in deep neural networks. In Advances
in neural information processing systems (NeurIPS), pages 2074–2082,
2016. 1, 2, 5
[72] Colin White, RealityEngines AI, Willie Neiswanger, and Yash Savani.
Deep uncertainty estimation for model-based neural architecture search.
3
[73] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun,
Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt
Keutzer. Fbnet: Hardware-aware efficient convnet design via differen-
tiable neural architecture search. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 10734–
10742, 2019. 1, 3
[74] Lingxi Xie and Alan Yuille. Genetic cnn. In Proceedings of the IEEE
international conference on computer vision, pages 1379–1388, 2017. 2
[75] Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xu-
anzhe Liu. Deepcache: Principled cache for mobile deep vision. In
Proceedings of the 24th Annual International Conference on Mobile
Computing and Networking, pages 129–144. ACM, 2018. 3
[76] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi,
Qi Tian, and Hongkai Xiong. Pc-darts: Partial channel connections
for memory-efficient differentiable architecture search. arXiv preprint
arXiv:1907.05737, 2019. 3
[77] Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek
Abdelzaher. Deepsense: A unified deep learning framework for time-
series mobile sensing data processing. In Proceedings of the 26th
International Conference on World Wide Web, pages 351–360, 2017.
3
[78] Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, and
Changshui Zhang. Greedynas: Towards fast one-shot nas with greedy
supernet. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 1999–2008, 2020. 2
[79] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu,
Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp:
Pruning networks using neuron importance score propagation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 9194–9203, 2018. 2
[80] Ming Yuan and Yi Lin. Model selection and estimation in regression
with grouped variables. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 68(1):49–67, 2006. 2
[81] Tianyun Zhang, Shaokai Ye, Yipeng Zhang, Yanzhi Wang, and Makan
Fardad. Systematic weight pruning of dnns using alternating direction
method of multipliers. arXiv preprint arXiv:1802.05747, 2018. 2, 5, 7
[82] Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang,
and Qi Tian. Variational convolutional neural network pruning. In CVPR,
pages 2780–2789, 2019. 1
[83] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Prac-
tical block-wise neural network architecture generation. In Proceedings
of the IEEE conference on computer vision and pattern recognition,
pages 2423–2432, 2018. 2
[84] Xiaotian Zhu, Wengang Zhou, and Houqiang Li. Improving deep neural
network sparsity through decorrelation regularization. In IJCAI, 2018.
1
[85] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo,
Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware
channel pruning for deep neural networks. In Advances in Neural
Information Processing Systems (NeurIPS), pages 875–886, 2018. 1,
5, 6
[86] Barret Zoph and Quoc V. Le. Neural architecture search with reinforce-
ment learning. In International Conference on Learning Representations
(ICLR), 2017. 1, 2
[87] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le.
Learning transferable architectures for scalable image recognition. In
Proceedings of the IEEE conference on computer vision and pattern
recognition (CVPR), pages 8697–8710, 2018. 2, 8
