VarGNet: Variable Group Convolutional Neural Network for Efficient
  Embedded Computing by Zhang, Qian et al.
VarGNet: Variable Group Convolutional Neural
Network for Efficient Embedded Computing
Qian Zhang, Jianjun Li, Meng Yao, Liangchen Song, Helong Zhou,
Zhichao Li, Wenming Meng, Xuezhi Zhang, Guoli Wang
Horizon Robotics
Abstract
In this paper, we propose a novel network design mechanism for efficient embedded
computing. Inspired by the limited computing patterns, we propose to fix the
number of channels in a group convolution, instead of the existing practice that
fixing the total group numbers. Our solution based network, named Variable Group
Convolutional Network (VarGNet), can be optimized easier on hardware side, due
to the more unified computing schemes among the layers. Extensive experiments
on various vision tasks, including classification, detection, pixel-wise parsing and
face recognition, have demonstrated the practical value of our VarGNet.
1 Introduction
Empowering embedded systems to run the well-known deep learning architectures, such as convolu-
tional neural networks (CNNs), has been a hot topic in recent years. For smart Internet of Things
applications, the challenging part is that the whole system is required to be both energy-constrained
and of small size. To meet the challenge, the work of improving the efficiency of the whole computing
process can be roughly broken into two directions: The first is to design lightweight networks which
has a small MAdds [20, 38, 52, 30], thus friendly to low power consumption platforms; The second
is to optimize hardware-side configurations, such as FPGA based accelerators [13, 50], or to make
the whole computing process more efficient by improving the compiler and generating more smart
instructions [2, 6, 48].
All of the mentioned works above have demonstrated their great practical value in various applications.
However, the real performance may not live up to the designer’s expectations, due to the gap between
the two different optimization directions. Specifically, for elaborately tuned networks with small
MAdds, the overall latency may be high [30], while for carefully designed compilers or accelerators,
the real networks may be hard to be processed.
In this work, we intend to close the exiting gap by systematically analyze the necessary properties of
a lightweight network that is friendly to the embedded hardware and the corresponding compilers.
More precisely, since the computation patterns of a chip in a embedded system is strictly limited, we
propose that a embedded-system-friendly network should fit into the targeted computation patterns
and also the ideal data layout. By fitting into the ideal data layout, we can reduce the communication
cost between on-chip memory and off-chip memory, thus fully exploit the computation throughput.
Inspired by the observation that the computation graph of a network is easier to be optimized, if the
computational intensity of the operations in a network is more balanced. We propose the variable
group convolution, which is based on depthwise separable convolution [25, 8, 47]. In variable group
convolution, the number of input channels in each group is fixed and can be tuned as a hyperparameter,
which is different from the group convolution where the number of groups are fixed. The benefits are
two folds: Fixing the number of channels is more suitable for optimization from the perspective of




















convolution in [20, 38], which set the group number to be the channel number, variable group
convolution has a larger network capacity [38], thus allowing the smaller channel numbers, which
helps relief the time consuming off-chip communication.
Another key component in our network is to better exploit the on-chip memory based on the inverted
residual block [38]. However, in MobileNetV2 [38], the number of channels are adjusted by pointwise
convolutions, which has a different computing pattern with the 3×3 depthwise convolution in between
and then is hard to be optimized due to limited computation patterns. Therefore, we propose that the
input feature with C channels is first expanded to 2C by variable group convolution and returned to
C by pointwise convolution. In this manner, the computational costs between the two types of layers
are more balanced, thus being more hardware and compiler friendly. To sum up, our contributions
can be summed as follows:
• We systematically analyze how to optimize the computation of CNNs from the perspective
of both network architectures and hardware/compilers on embedded systems. We found that
there exists a gap between the two optimization directions that some elaborately designed
architectures are hard to be optimized due to limited computation patterns in an embedded
system.
• Observing that more unified computation pattern and data layout are more friendly to
an embedded system, we propose the variable group convolution and the corresponding
improved whole network, named variable group network and VarGNet for short.
• Experiments on prevalent vision tasks, such as classification, detection, segmentation, face
recognition and etc., and corresponding large scale datasets verify the practical value of our
proposed VarGNet.
1.1 Related works
Lightweight CNNs. Designing lightweight CNNs has been a hot topic in recent years. Representa-
tive manual designed networks include SqueezeNet [22], Xception [8], MobileNets [20, 38], Shuf-
fleNets [52, 30] and IGC [51, 46, 41]. Besides, neural architecture search (NAS) [53, 35, 37, 54, 28]
is a promising direction for automatically designing lightweight CNNs. The above methods are
capable to effectively speed up the recognition process. More recently, platforms aware NAS methods
are proposed [4, 44, 10, 40] to search some specific networks that are efficient on certain hardware
platforms. Our network, VarGNet, is complementary to the existing platforms aware NAS methods,
since the proposed variable group convolution is helpful for setting the search space in NAS methods.
Optimizations on CNN accelerators. To accelerate neural networks, FPGAs [13, 50, 17, 31]
and ASIC designs [7, 36, 23, 29, 19] have been widely studied. Generally speaking, Streaming
Architectures (SAs) [42, 45] and Single Computation Engines (SCEs) [15, 5, 2] are two kinds of
FPGA based accelerators [43]. The difference between the two directions is on customization and
generality. SAs designs seek customization more than generality, while SCEs emphasize the tradeoff
between flexibility and customization. In this work, we hope to propose a network that can be
optimized by existing accelerators more easily, thus improve the overall performance.
2 Designing Efficient Networks on Embedded Systems
For chips used on embedded systems, such as FPGA or ASIC, a low unit price as well as a fast
time to market are critical factors in designing the whole system. Such crucial points result in a
relative simple chip configuration. In other words, the computation schemes are strictly limited when
compared with general-purpose processing units. However, operators in a SOTA network are so
complex that some layers can be accelerated by hardware design while others not. Thus, for designing
efficient networks on embedded systems, the first intuition here is that the layers in a network should
be similar as each other in some sense.
Another important intuition is based on two properties of convolutions used in CNNs. The first
property is the computation pattern. In convolution, several filters (kernels) slide over the whole
feature map, indicating that the kernels are repeatedly used while values from the feature map are only
used once. The second property is the data size of convolutional kernels and feature maps. Typically,

























































(b) Down sampling block.
Figure 1: Variable Group Network.
and 2HWC for feature maps in 2D convolutions. In light of the above two properties, an ingenious
solution is to load all the data of kernels first and then perform the convolution with popping and
popping out feature data sequentially [48] . Such practical solution is the second intuition for our
following two guidelines for efficient network design on embedded systems:
• It will be better if the size of intermediate feature maps between blocks is smaller.
• The computational intensity of layers in a block should be balanced.
Next, we introduce the two guidelines in detail.
Small intermediate feature maps between blocks. In SOTA networks, a common practice is to
first design a normal block and a down sampling block first, and then stack several blocks together to
get a deep network. Also, in these blocks, residual connections [18] are widely adopted. So, in recent
compiler-side optimizations [48] , layers in a block are usually grouped and computed together. In
such manner, off-chip memory and on-chip memory only communicates when starting or ending
computing a block in the network. Therefore, a smaller intermediate feature map between blocks will
certainly help reduce the data transfer time.
Balanced computational intensity inside a block. As mentioned before, in practice, weights in
several layers are loaded before performing convolution. If the loaded layers have a large divergence
in terms of the computational intensity, extra on-chip memory is needed to store the intermediate slices
of feature maps. In MobileNetV1 [20], a depthwise conv and a pointwise conv are used. Different
from previous definitions, in our implementation, weights are already loaded. So, computational
intensity is computed as MAdds divide the size of feature maps. Then, if the feature map is of size
28× 28× 256, the computational intensity of depthwise convolution and pointwise convolution are
9 and 256, respectively. As a result, when running the two layers, we have to increase the on-chip
buffer to satisfy the pointwise, or not grouping the computation of the two layers together.
3 Variable Group Network
Based on the previous mentioned two guidelines, we propose a novel network in this section. To
balance the computation intensity, we set the channel numbers in a group in a network to be constant,
resulting in variable groups in each convolution layers. The motivation of fixing the channel numbers




Thus, if the size of feature map is a constant, then by fixingG = ChannelsGroups , the computational intensity
inside a block is more balanced. Further, the number of channels in a group can be set to satisfy the
configurations of the processing elements, in which channels of a certain number will be processed
every time.
Compared with depthwise convolution, the variable group convolution increases the MAdds as well
as the expressiveness [38]. Thus, now we are able to reduce the channel number of intermediate





























































Figure 2: Computing scheme of a normal block in Variable Group Network. The weights of four
convolution operations are first loaded onto on-chip memory, and then processing the features.
Table 1: Overall architecture of Variable Group Network v1.
Layer Output Size KSize Stride Repeat Output Channels0.25x 0.5x 0.75x 1x 1.25x 1.5x 1.75x
Image 224 x 224 3 3 3 3 3 3 3
Conv 1 112 x 112 3 x 3 2 1 8 16 24 32 40 48 56
DownSample 56 x 56 2 3 16 32 48 64 80 96 112
DownSample 28 x28 2 1 32 64 96 128 160 192 224
DownSample 14 x 14 2 1 64 128 192 256 320 384 448
Stage Block 14 x 14 1 2 64 128 192 256 320 384 448
DownSample 7 x 7 2 1 128 256 384 512 640 768 896
Stage Block 7 x 7 1 1 128 256 384 512 640 768 896
Conv 5 7 x 7 1 x 1 1 1 1024 1024 1024 1024 1280 1536 1792
Global Pool 1 x 1 7 x 7
FC 1000 1000 1000 1000 1000 1000 1000
design novel network blocks as shown in Fig. 1. For the normal block used in the early stages in the
whole network, since the size of weights are relatively small at this time, the weights of the four layers
can be all cached into the on-chip memory. When entering the late stages, where channel numbers
increase and the size of weights increase as well, the normal block is also able to be optimized by
only loading a variable group conv and a pointwise conv. Similarly, the operations in down sampling
block are also friendly to the compiler-side and hardware-side optimizations. The whole computing
process for a normal block is demonstrated in Fig. 2. Then, based on the architecture of MobileNetV1
[20], we substitute their basic blocks to ours and the whole detailed network architecture is shown in
Tab. 1. Also, another ShuffleNet v2 based architecture is shown in Tab. 2.
4 Experiments
4.1 ImageNet Classification
The results of our model on ImageNet are presented in Tab. 3 and Tab. 4. Training hyperparameters
are set as: batch size 1024, crop ratio 0.875, learning rate 0.4, cosine learning rate schedule, weight
decay 4e-5 and training epochs 240. We can observe that VarGNet v1 performs better than MobileNet
v1, as shown in Tab. 3. From (c) in Tab. 4, we can see that when the model scale is small, the
performance of VarGNet v2 is worse than ShuffleNet v2, due to less channels used in our VarGNet
v2. Then, when the model size is large, our network performs better.
4
Table 2: Overall architecture of Variable Group Network v2. Head Block is a modified version
of Normal Block, by setting the stride to 2 and keeping the channel numbers unchanged after two
variable convolution layers.
Layer Output Size KSize Stride Repeat Output Channels0.25x 0.5x 0.75x 1x 1.25x 1.5x 1.75x 2x
Image 224 x 224 3 3 3 3 3 3 3 3
Conv 1 112 x 112 3 x 3 2 1 8 16 24 32 40 48 56 64
Head Block 56 x 56 2 1 8 16 24 32 40 48 56 64
Stage 2 28 x28 2 1 16 32 48 64 80 96 112 12828 x 28 1 2
Stage 3 14 x 14 2 1 32 64 96 128 160 192 224 25614 x 14 1 6
Stage 4 7 x 7 2 1 64 128 192 256 320 384 448 5127 x 7 1 3
Conv 5 7 x 7 1 x 1 1 1 1024 1024 1024 1024 1280 1536 1792 2048
Global Pool 1 x 1 7 x 7
FC 1000 1000 1000 1000 1000 1000 1000 1000
Table 3: VarGNet v1 performance on ImageNet. (G is the number of channels in a group.)
(a) G = 4
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 63.80% 1.44M 55M 128
0.5 69.71% 2.23M 157M 256
0.75 72.38% 3.43M 309M 384
1 73.64% 5.02M 509M 512
1.25 74.34% 7.42M 767M 640
1.5 74.47% 10.28M 1.05G 768
(b) G = 8
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 64.90% 1.5M 75M 128
0.5 70.40% 2.37M 198M 256
0.75 72.60% 3.66M 370M 384
1 73.90% 5.33M 590M 512
1.25 74.70% 7.8M 869M 640
1.5 75.00% 10.7M 1.17G 768
1.75 75.30% 14.1M 1.54G 1024
(c) Comparison network: MobileNet v1
Model Scale Acc(top1) Model size MAdds Max Channels
0.35 60.4% 0.7 M 72 M 358
0.6 68.6% 1.7 M 201 M 614
0.85 72.0% 3.1M 394 M 870
1.0 73.3% 4.1M 542 M 1024
1.05 73.5% 4.4 M 594 M 1075
1.3 74.7% 6.4 M 903 M 1331
1.5 75.1% 8.3 M 1.17 G 1536
5
Table 4: VarGNet v2 performance on ImageNet. (G is the number of channels in a group.)
(a) G = 4
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 59.39% 1.27M 35M 64
0.5 66.98% 1.72M 92M 128
0.75 70.42% 2.35M 173M 192
1 72.76% 3.19M 278M 256
1.25 74.08% 4.55M 411M 320
1.5 74.91% 6.14M 569M 384
2 75.44% 10.0M 961M 512
(b) G = 8
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 59.81% 1.35M 51M 64
0.5 67.80% 1.87M 124M 128
0.75 70.36% 2.58M 222M 192
1 73.10% 3.49M 343M 256
1.25 74.34% 4.94M 492M 320
1.5 75.04% 6.60M 666M 384
1.75 75.49% 8.50M 866M 448
2 75.71% 10.6M 1.06G 512
(c) Comparison network: ShuffleNet v2
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 (60) 63.85% 1.47M 51M 240
0.5 (108) 68.74% 2.1M 123M 432
0.75 (154) 71.65% 2.92M 223M 616
1 (196) 73.17% 3.87M 342M 784
1.25 (228) 74.15% 6.63M 494M 912
1.5 (270) 74.56% 8.06M 666M 1080
1.75 (312) 75.24% 9.68M 863M 1248
4.2 Object Detection
In Tab. 5, we present the performance of our proposed VarGNet as well as comparison methods.
We evaluate the object detection performance of our proposed networks on COCO datasets [27] and
compare them with other state-of-the-art lightweight architectures. We choose FPN-based Faster
R-CNN [26] as the framework and all the experiments are implemented under the same settings with
the input resolution being 800×1333 and the number of epochs being 18. Specially, we find that
ShuffleNet v2 achieves better accuracy if trained with more epochs so a model with 27 epochs is
trained for ShuffleNet v2. 1000 proposals per image are evaluated in RPN stage at test time. We use
train+val set for training except 8000 minimal images and finally test on minival set.
4.3 Pixel Level Parsing
4.3.1 Cityscapes
On Cityscapes dataset [9], we designed a multi-task structure (Fig. 3a) to conduct two important
pixel level parsing tasks: single image depth prediction and segmentation.
Traning setup. We use the standard Adam Optimizer with weight decay set to 1e-5 and batch size
set to 16. The learning rate is initialized as 1e-4 and follows a polynomial decay with power of
0.9. Total training epochs are set as 100. For data augmentation, random horizontal flip is used and
images are resized with scale randomly chosen from 0.6-1.2. For multitask training, we have the loss
6
Table 5: Performance on COCO object detection with FPN based Faster R-CNN. The input image
size is 800 × 1333.
Network MAdds (G) mAP
MobileNet v1 1.0 24.15 31.1
MobileNet v2 1.0 18.71 31.0
ShuffleNet v1 1.0 15.31 27.9
ShuffleNet v2 1.0 15.55 27.5
ShuffleNet v2 1.0 (27 epochs) 15.55 28.9
VarGNet v1 1.0 24.91 33.7
VarGNet v2 0.5 14.98 28.6








(a) The multi-task network used in Cityscapes ex-
periments.
1
(b) The U-Net style network used in KITTI experi-
ments.
Figure 3: Network architectures.
function defined as
Ltotal = λinstanceLinstance + λsemanticLsemantic + λdepthLdepth.
When the task is panoptic segmentation, we set λinstance = 0.2, λsemantic = 1.0. After adding depth
task, we set λdepth = 0.08.
Results. Parameters and MAdds of comparison methods are presented in Table. 6. Results and
some visual examples on segmentation and depth prediction are shown in Table. 7 and Fig. 4,
respectively. The priority of the proposed VarGNet v1 and v2 is proved by the above tables. VarGNet
v1 and v2 are efficient and can perform equally well when compared with large networks.
Table 6: Details of comparison methods and ours on pixel level parsing tasks with input size 640×360.
Method Backbone MAdds(G) Params
SegNet[3] VGG16 286.0 29.5M
Enet[34] From scratch 3.8 0.4M
BiSeNet[49] Xception39 2.9 5.8M
BiSeNet[49] Res18 10.8 49.0M
MobileNet v2 - 6.82 7.64M
VarGNet v1 - 6.16 13.23M
VarGNet v2 - 2.76 7.41M
7










Figure 4: Visual results on Cityscapes validation set.
8
Table 7: Results on Cityscapes validation set.
(a) Semantic Segmentation (image size 2048×1024)
Method Backbone Mean IoU(%)
BiSeNet[49] Xception39 69.0
BiSeNet[49] Res18 74.8
MobileNet v2 - 64.8
VarGNet v1 - 76.6
VarGNet v2 - 74.2
(b) Depth
Method AbsRel SqlRel RMSE RMSE Log
MobileNet v2 0.167 3.22 15.46 0.553
VarGNet v1 0.092 1.327 8.864 0.163
VarGNet v2 0.096 1.404 8.85 0.168
(c) Panoptic Segmentation (MAdds calculated with 2048×1024 input size.)
Method Backbone MAdds PQ PQ(Things) PQ(Stuff) Mean IoU(%)
PFPnet[24] resnet101 533G 58.1 52 62.5 75.1
VarGNet v1 - 104G 57.1 50 62.3 73.4
VarGNet v2 - 68G 54.5 45.1 59.8 71.4
(d) Panoptic Segmentation + Depth (MAdds calculated with 2048×1024 input size.)
Method MAdds PQ PQ(Things) PQ(Stuff) Mean IoU(%) AbsRel RMSE
VarGNet v1 109G 56 48.8 61.3 71 0.1 9.2
VarGNet v2 70G 53.9 46.2 59.5 70.5 0.116 10.06
4.4 KITTI
Traning setup. For single image depth prediction and stereo tasks on KITTI dataset [14], we
present the performance of our VarGNet based models. A U-Net style architecture (3b) is employed
in the experiments. All the depth models are trained on KITTI RAW datasets, We test on 697 images
from 29 scenes split by Eigen et al. [12], and train on about 23488 images from the remaining 32
scenes. All the experiment results are evaluated with the depth ranging from 0m to 80m and 0m to
50m. The evaluation metrics are the same as previous works. All the stereo models are trained on
KITTI RAW datasets, We test on test set split by Eigen et al. [12], and train set of KITTI15. The
evaluation metrics for stereo are EPE and D1. During training, standard SGD Optimizer is used, and
the momentum set to 0.9. The standard weight decay is set to 0.0001 for resnet18 and resnet50, and
0.00004 for others. The iteration number is set to 300 epochs. The initial learning rate is 0.001, and
learning rate decay 0.1 at [120, 180, 240] epoch. We use 4 GPU to train models, and the batch size is
set to 24.
Results. In Table. 8 and Table. 9, we show our depth results and stereo results under various
evaluation metrics. Also, we report our implemented MobileNet and ResNet as comparison. Further,
visual effects are presented in Fig. 5 and Fig. 6.
4.5 Face Recognition
All the networks are trained on the DeepGlint MS-Celeb-1M-v1c dataset [1] cleaned from MS-Celeb-
1M [16]. There are 3,923,399 aligned face images from 86,876 ids. The LFW [21] , CFP-FP [39]
and AgeDB-30 [32] are used as the validation datasets. Finally, all network models are evaluated
on MegaFace Challenge 1 [33]. Table. 10 lists the best face recognition accuracies on validation
datasets, as well as face verification true accepted rates under 1e-6 false accepted rate on the refined
version of MegaFace dataset [11]. We use MobileNet v1 and MobileNet v2 as baseline models. To
adapt the input image size of 112x112, the stride of the first convolutional layer is set to 1 for each
9
Table 8: Depth results on KITTI test set.
(a) 0-80m
Method AbsRel SqlRel RMSE RMSE Log δ<1.25 δ<1.252 δ<1.253 MAdds(G) Params
MobileNet v2 1.0 0.103 0.744 4.686 0.17 0.888 0.966 0.987 36.8 7.6 M
MobileNet v2 0.5 0.112 0.865 5.01 0.183 0.869 0.959 0.983 10.0 1.9 M
MobileNet v2 0.25 0.113 0.831 4.988 0.183 0.866 0.96 0.985 2.9 539.2 K
ResNet 18 0.109 0.767 4.76 0.178 0.869 0.961 0.986 203.4 30.6 M
ResNet 50 0.109 0.788 4.796 0.18 0.868 0.959 0.984 247.5 46.7 M
VarGNet v1 1.0 0.105 0.798 4.92 0.175 0.883 0.965 0.986 36.0 13.2 M
VarGNet v1 0.5 0.107 0.803 4.86 0.175 0.881 0.964 0.986 12.8 3.8 M
VarGNet v1 0.25 0.113 0.845 5.003 0.18 0.87 0.962 0.986 5.1 1.2 M
VarGNet v2 1.0 0.108 0.823 4.898 0.176 0.881 0.965 0.986 20.0 7.4 M
VarGNet v2 0.5 0.111 0.851 4.98 0.179 0.874 0.961 0.985 7.7 2.2 M
VarGNet v2 0.25 0.118 0.9 5.11 0.186 0.863 0.959 0.985 3.3 788.1 K
(b) 0-50m
Method AbsRel SqlRel RMSE RMSE Log δ<1.25 δ<1.252 δ<1.253 MAdds(G) Params
MobileNet v2 1.0 0.097 0.557 3.424 0.155 0.903 0.972 0.989 36.8 7.6 M
MobileNet v2 0.5 0.106 0.649 3.665 0.167 0.886 0.966 0.986 10.0 1.9 M
MobileNet v2 0.25 0.106 0.63 3.693 0.168 0.883 0.966 0.988 2.9 539.2 K
ResNet 18 0.104 0.584 3.525 0.164 0.883 0.967 0.988 203.4 30.6 M
ResNet 50 0.104 0.592 3.521 0.165 0.883 0.965 0.987 247.5 46.7 M
VarGNet v1 1.0 0.098 0.578 3.534 0.158 0.899 0.973 0.99 36.0 13.2 M
VarGNet v1 0.5 0.1 0.603 3.535 0.159 0.897 0.97 0.989 12.8 3.8 M
VarGNet v1 0.25 0.106 0.637 3.648 0.165 0.887 0.969 0.989 5.1 1.2 M
VarGNet v2 1.0 0.101 0.612 3.556 0.16 0.896 0.971 0.989 20.0 7.4 M
VarGNet v2 0.5 0.104 0.635 3.639 0.163 0.89 0.968 0.988 7.7 2.2 M
VarGNet v2 0.25 0.112 0.681 3.768 0.171 0.88 0.966 0.988 3.3 788.1 K
Table 9: Stereo results on KITTI.
(a) On KITTI RAW
Method EPE D1 MAdds(G) Params
MobileNet v2 1.0 1.424 0.0777 37.0 7.6 M
MobileNet v2 0.5 1.4904 0.0832 10.1 1.9 M
MobileNet v2 0.25 1.5897 0.0902 2.9 539.5 K
ResNet 18 1.5269 0.0886 205.4 30.6 M
ResNet 50 1.531 0.0887 249.5 46.7 M
VarGNet v1 1.0 1.3296 0.0703 36.1 13.2 M
VarGNet v1 0.5 1.4045 0.0757 12.9 3.8 M
VarGNet v1 0.25 1.5111 0.0835 5.1 1.2 M
VarGNet v2 1.0 1.3582 0.0728 20.7 7.4 M
VarGNet v2 0.5 1.44 0.079 8.0 2.2 M
VarGNet v2 0.25 1.5346 0.0862 3.4 790.2 K
(b) On KITTI 15
Method EPE D1 MAdds Params
MobileNet v2 1.0 1.7387 0.0753 37.0 7.6 M
MobileNet v2 0.5 1.6861 0.0772 10.1 1.9 M
MobileNet v2 0.25 1.6754 0.0819 2.9 539.5 K
ResNet 18 1.7318 0.0873 205.4 30.6 M
ResNet 50 1.7305 0.0868 249.5 46.7 M
VarGNet v1 1.0 1.5767 0.07 36.1 13.2 M
VarGNet v1 0.5 1.5868 0.0708 12.9 3.8 M
VarGNet v1 0.25 1.6685 0.0747 5.1 1.2 M
VarGNet v2 1.0 1.5856 0.0697 20.7 7.4 M
VarGNet v2 0.5 1.5994 0.0735 8.0 2.2 M
VarGNet v2 0.25 1.6302 0.0777 3.4 790.2 K
10
(a) Input Image (b) GT
(c) MobileNet v2 1.0 (d) MobileNet v2 0.5 (e) MobileNet v2 0.25
(f) ResNet 18 (g) ResNet 50
(h) VarGNet v1 1.0 (i) VarGNet v1 0.5 (j) VarGNet v1 0.25
(k) VarGNet v2 1.0 (l) VarGNet v2 0.5 (m) VarGNet v2 0.25
Figure 5: Visualization of depth results on KITTI RAW.
Table 10: Face recognition results.
Networks MAdds LFW [21] CFP-FP [39] AgeDB-30 [32] MegaFace [11]
MobileNet v1 554M 0.99617 0.89714 0.96600 0.935848
MobileNet v2 313M 0.99500 0.86386 0.95583 0.898219
VarGNet v1 603M 0.99733 0.88929 0.97583 0.961499
VarGNet v2 355M 0.99733 0.89829 0.97333 0.954261
baseline and vagnet model. To achieve better performance, we further replace the pooling layer
by a “BN-Dropout-FC-BN” structure as InsightFace [11], followed by the ArcFace loss [11]. The
standard SGD optimizer is used with momentum 0.9 and the batch-size is set to 512 with 8 GPUs.
The learning rate begins with 0.1 and is divided by 10 at the 100K, 140K and 160K iterations. We
set the weight decay to be 5e-4. The embedding feature dimension is 256 with 0.4 dropout rate.
The normalization scale is 64 and the ArcFace margin is set to 0.5. All training are based on the
InsightFace toolbox [11].
11
(a) Left Input Image (b) Right Input Image (c) GT
(d) MobileNet v2 1.0 (e) MobileNet v2 0.5 (f) MobileNet v2 0.25
(g) ResNet 18 (h) ResNet 50
(i) VarGNet v1 1.0 (j) VarGNet v1 0.5 (k) VarGNet v1 0.25
(l) VarGNet v2 1.0 (m) VarGNet v2 0.5 (n) VarGNet v2 0.25
Figure 6: Visualization of stereo results on KITTI15.
References
[1] http://trillionpairs.deepglint.com/overview.
[2] Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika
Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C Ling, et al. Dla: Compiler and
fpga overlay for neural network inference acceleration. In International Conference on Field
Programmable Logic and Applications, pages 411–4117. IEEE, 2018.
[3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 39(12):2481–2495, 2017.
[4] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target
task and hardware. In International Conference on Learning Representations (ICLR), 2019.
[5] Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello. Compil-
ing deep learning models for custom hardware accelerators. arXiv preprint arXiv:1708.00117,
2017.
[6] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q Yan, Leyuan Wang, Yuwei
Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: end-to-end optimization
stack for deep learning. arXiv preprint arXiv:1802.04799, pages 1–15, 2018.
[7] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier
Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-
learning. In ACM Sigplan Notices, volume 49, pages 269–284. ACM, 2014.
[8] François Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 1251–1258, 2017.
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo
Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic
urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3213–3223, 2016.
12
[10] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat
Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. Chamnet: Towards efficient network
design through platform-aware model adaptation. arXiv preprint arXiv:1812.08934, 2018.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular
margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
[12] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image
using a multi-scale deep network. In Advances in neural information processing systems, pages
2366–2374, 2014.
[13] Clément Farabet, Cyril Poulet, Jefferson Y. Han, and Yann LeCun. CNP: an fpga-based
processor for convolutional networks. In International Conference on Field Programmable
Logic and Applications, pages 32–37, 2009.
[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The
kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
[15] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang.
Angel-eye: A complete design flow for mapping cnn onto customized hardware. In IEEE
Computer Society Annual Symposium on VLSI (ISVLSI), pages 24–29. IEEE, 2016.
[16] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset
and benchmark for large-scale face recognition. In European Conference on Computer Vision
(ECCV), pages 87–102. Springer, 2016.
[17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning
with limited numerical precision. In International Conference on Machine Learning (ICML),
pages 1737–1746, 2015.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pages
770–778, 2016.
[19] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W
Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In
ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pages 674–687.
IEEE Press, 2018.
[20] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[21] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the
wild: A database forstudying face recognition in unconstrained environments. In Workshop on
faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
[22] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and
Kurt Keutzer. SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb
model size. arXiv:1602.07360, 2016.
[23] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder
Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance
analysis of a tensor processing unit. In ACM/IEEE Annual International Symposium on
Computer Architecture (ISCA), pages 1–12. IEEE, 2017.
[24] Alexander Kirillov, Ross B. Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid
networks. arXiv preprint arXiv:1901.02446, 2019.
[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[26] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J.
Belongie. Feature pyramid networks for object detection. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 936–944, 2017.
[27] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco:
Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
13
[28] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search.
In International Conference on Learning Representations (ICLR), 2019.
[29] Tao Luo, Shaoli Liu, Ling Li, Yuqing Wang, Shijin Zhang, Tianshi Chen, Zhiwei Xu, Olivier
Temam, and Yunji Chen. Dadiannao: A neural network supercomputer. IEEE Transactions on
Computers, 66(1):73–88, 2017.
[30] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines
for efficient cnn architecture design. In European Conference on Computer Vision (ECCV),
pages 116–131, 2018.
[31] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation and dataflow
in fpga acceleration of deep convolutional neural networks. In ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pages 45–54. ACM, 2017.
[32] Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas, Jiankang Deng, Irene
Kotsia, and Stefanos Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
pages 51–59, 2017.
[33] Aaron Nech and Ira Kemelmacher-Shlizerman. Level playing field for million scale face
recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
7044–7053, 2017.
[34] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural
network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147,
2016.
[35] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural
architecture search via parameter sharing. In International Conference on Machine Learning
(ICML), pages 4092–4101, 2018.
[36] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu
Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-
power, highly-accurate deep neural network accelerators. In ACM/IEEE Annual International
Symposium on Computer Architecture (ISCA), pages 267–278. IEEE, 2016.
[37] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for
image classifier architecture search. CoRR, abs/1802.01548, 2018.
[38] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
[39] Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, Vishal M Patel, Rama Chellappa, and
David W Jacobs. Frontal to profile face verification in the wild. In IEEE Winter Conference on
Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016.
[40] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie
Liu, and Diana Marculescu. Single-path nas: Designing hardware-efficient convnets in less than
4 hours. arXiv preprint arXiv:1904.02877, 2019.
[41] Ke Sun, Mingjie Li, Dong Liu, and Jingdong Wang. Igcv3: Interleaved low-rank group
convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178, 2018.
[42] Stylianos I Venieris and Christos-Savvas Bouganis. fpgaconvnet: Automated mapping of
convolutional neural networks on fpgas. In ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pages 291–292. ACM, 2017.
[43] Stylianos I Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. Toolflows for mapping
convolutional neural networks on fpgas: A survey and future directions. ACM Computing
Surveys (CSUR), 51(3):56, 2018.
[44] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet
design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.
[45] Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. Exploring het-
erogeneous algorithms for accelerating deep convolutional neural networks on fpgas. In
ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2017.
14
[46] Guotian Xie, Jingdong Wang, Ting Zhang, Jianhuang Lai, Richang Hong, and Guo-Jun Qi.
Interleaved structured sparse convolutional neural networks. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 8847–8856, 2018.
[47] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1492–1500, 2017.
[48] Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yu Wang,
and Yi Shan. Dnnvm: End-to-end compiler leveraging heterogeneous optimizations on fpga-
based cnn accelerators. arXiv preprint arXiv:1902.07463, 2019.
[49] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet:
Bilateral segmentation network for real-time semantic segmentation. In European Conference
on Computer Vision (ECCV), pages 325–341, 2018.
[50] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimiz-
ing fpga-based accelerator design for deep convolutional neural networks. In ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages 161–170, 2015.
[51] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In
IEEE International Conference on Computer Vision (ICCV), pages 4373–4382, 2017.
[52] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient
convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 6848–6856, 2018.
[53] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR,
abs/1611.01578, 2016.
[54] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable
architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.
15
