A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT
  Devices by Zhang, Xiaofan et al.
A Bi-Directional Co-Design Approach to Enable Deep Learning
on IoT Devices
Xiaofan Zhang 1 Cong Hao 1 Yuhong Li 1 Yao Chen 2 Jinjun Xiong 3 Wen-mei Hwu 1 Deming Chen 1 4
Abstract
Developing deep learning models for resource-
constrained Internet-of-Things (IoT) devices is
challenging, as it is difficult to achieve both good
quality of results (QoR), such as DNN model in-
ference accuracy, and quality of service (QoS),
such as inference latency, throughput, and power
consumption. Existing approaches typically sep-
arate the DNN model development step from its
deployment on IoT devices, resulting in subopti-
mal solutions. In this paper, we first introduce a
few interesting but counterintuitive observations
about such a separate design approach, and empir-
ically show why it may lead to suboptimal designs.
Motivated by these observations, we then propose
a novel and practical bi-directional co-design ap-
proach: a bottom-up DNN model design strategy
together with a top-down flow for DNN accel-
erator design. It enables a joint optimization of
both DNN models and their deployment configu-
rations on IoT devices as represented as FPGAs.
We demonstrate the effectiveness of the proposed
co-design approach on a real-life object detection
application using Pynq-Z1 embedded FPGA. Our
method obtains the state-of-the-art results on both
QoR with high accuracy (IoU) and QoS with high
throughput (FPS) and high energy efficiency.
1. Introduction
To enable deep learning capability on IoT devices, there
are two major components to be designed: the software,
e.g., DNN models featuring parameters through learning
for specific applications, and the hardware, such as DNN
accelerators running on GPUs, FPGAs, or ASICs. Both
of them contribute to the overall QoR and QoS without
clear distinctions, so there is an urgent need of DNN and
accelerator co-design.
1.1. Drawbacks of Independent Design Approaches
Typically, DNNs and their accelerators are designed and
optimized separately for IoT applications in an iterative
manner. DNNs are first designed with more concentrations
on QoR. Such DNNs can be excessively complicated for
1Department of ECE, University of Illinois Urbana-Champaign,
USA 2Advanced Digital Sciences Center, Singapore 3IBM T. J.
Watson Research Center, USA 4Inspirit IoT, Inc., USA. Correspon-
dence to: Xiaofan Zhang <xiaofan3@illinois.edu>.
the targeted IoT devices, which must be compressed using
quantization, network pruning, or sparsification (Wang et al.,
2018; Han et al., 2017) before implementing on hardware,
and then be retrained to maintain inference accuracy. Since
no hardware constraints are captured during DNN design,
this design methodology can only expect hardware accel-
erators to deliver good QoS through later optimizations on
hardware prospects. On the other hand, the DNN acceler-
ator design usually uses a consistent overall architecture
(such as the recurrent (Aydonat et al., 2017; Zeng et al.,
2018; Jouppi et al., 2017) or pipelined structure (Li et al.,
2016; Zhang et al., 2018)) but various scale-down factors
to meet different hardware constraints. When facing strict
hardware constraints, scaling-down the accelerator is not
always feasible as the shrinking resources can significantly
slow down the DNN inference process and result in poor
QoS. Design opportunities must turn to the algorithm side
and ask for more compact DNN models.
2. Empirical observations
2.1. Challenging HW/SW Configurations
One of the most fundamental barriers blocking the DNN
and accelerator design is the different sensitivities of
DNN/accelerator configurations (e.g., DNN model size,
hardware utilization features). It is hard to balance these
configurations using separated DNN/accelerator design ap-
proach, since a negligible change in DNN models may cause
huge differences in hardware accelerators and vice versa,
resulting in difficult trade-off between QoR and QoS.
Observation 1: similar compression rate but different ac-
curacy. When designing DNNs for IoT applications, it is
inevitable to perform model compression. Although the
overall QoS may be the same for a DNN with similar com-
pression rates, the compression of different DNN compo-
nents may cause great differences in QoR. As shown in
Fig. 1 (a), the accuracy trends vary significantly for quan-
tizing parameters and intermediate feature maps (FMs).
In this figure, the coordinates of the bubble center repre-
sent accuracy and model compression rate, while the area
shows data size in megabyte (MB). We scale-up the bub-
ble size of FM for better graphic effect. By compress-
ing the model from full-precision (float32) to 8-bit, 4-bit
fixed point, ternary and binary representations, we reduce
22X parameter size (237.9MB→10.8MB) and 16X FM size
(15.7MB→0.98MB), respectively. Results show that the
inference accuracy is more sensitive to the precision of FM
(9.8% accuracy drop with 16X compression) compared to
ar
X
iv
:1
90
5.
08
36
9v
1 
 [c
s.C
V]
  2
0 M
ay
 20
19
the parameters (4.8% accuracy drop with 22X compres-
sion). Challenges also come from the difficulty of DNN
training. As shown in Fig. 1 (b), the accuracy growth of
compressed model is quite unstable compared to the origi-
nal full-precision model. It requires more efforts to design
the training process (e.g., fine-tuning the training set-up or
iteratively modifying the DNN compression rate) and more
powerful machines (e.g., computer cluster for faster training
(Li et al., 2018)).
Observation 2: similar accuracy but different hardware re-
source utilization. DNN models with similar QoR may also
result in greatly different QoS because of different hardware
resource usage. Taking the implementation of a DNN ac-
celerator on FPGA as an example, a single bit difference
in data representation may result in considerable impacts
on hardware resource utilization. Fig. 2 (a) shows BRAM
(on-chip memory in FPGA) usage under different image
resize factors with 12∼16-bit data precision. By reducing
the resize factor from 1.00 to 0.78, we can maintain nearly
the same DNN accuracy (<1.0% drop), but can save half
memory when the resize factor is smaller than 0.9. Similarly,
Fig. 2 (b) indicates that different quantization combinations
of DNN feature maps and weights can result in great diverse
DSP utilization. Taking the 6-bit feature map as an example,
the DSP usage reduces from 128 to 64 when weights are
changed from 15-bit to 14-bit. The reason behind is the
limited bit-width support of each DSP to perform a two-
input multiplication. If the bit-width of two inputs exceed
a certain value, more DSPs are concatenated to handle one
multiplication, which can easily double the resource utiliza-
tion. These observations imply that the configuration of
hardware/software can cause great challenges of delivering
desired QoR and QoS on IoT devices.
2.2. Confusing QoR Upper-Bounds for Given Tasks
When deploying DNNs on IoT devices, it is common to first
find a DNN with desired QoR upper-bound for the targeted
application, and then to prune the DNN to make up for the
lost QoS on hardware. This solution assumes that compli-
cated DNNs with more parameters always deliver higher
QoR than simple DNNs with less parameters. However,
it is not always true. By examining a UAV-based object
detection task (DAC, 2018), we observe an abnormal trend
regarding model size and QoR upper-bound (Table 1), where
DNNs with more parameters fail to deliver higher accuracy
after adequate training. This implies that the current sepa-
rated DNN/accelerator design may only reach suboptimal
solutions, and requires more time and efforts for iterative
refining before delivering prefect QoR and QoS.
3. The Proposed Bi-Directional Co-Design
Motivated by the discussed observations, we propose a
bi-directional co-design methodology with a bottom-up
hardware-oriented DNN design, and a top-down acceler-
Figure 1. (a) Accuracy trends of AlexNet inference in ImageNet
dataset during parameter (blue) and feature map (green) compres-
sion with retraining. Model name is donated as precision p1 for
FM, p2 for the 1st CONV, p3 for the 2nd∼5th CONVs, p4 for
the 1st∼2nd FCs, and p5 for the 3rd FC in p1-p2p3p4p5 format;
(b) Training of ResNet-20 in Cifar10 dataset using ADMM with
full-precision (blue) and quantized (green) FMs and parameters.
Figure 2. (a) BRAM usages of accelerators with the same architec-
ture but 12∼16-bit quantization for feature maps (FM12∼FM16)
and different image resize factors. (b) DSP utilization of accelera-
tor using different quantizations between weights (W) and feature
maps (FMs) with the numbers indicating bits allocated.
ator design considering DNN-specific characteristics. Both
DNNs and accelerators are designed simultaneously to pur-
sue the best trade-off between QoS and QoR. The overall
flow of the proposed co-design is shown in Fig. 3. The
inputs of this flow include the targeted QoS, QoR, and the
hardware resource constraints; the outputs include the gen-
erated DNN model and its corresponding accelerator design.
We break down the whole flow into three steps:
Step1: Bundle construction and QoS evaluation. We ran-
domly select DNN components from the layer pool and
construct bundles (as basic building blocks of generated
DNNs) with different layer combinations. Each of the bun-
dle is evaluated by analytical models to capture the hardware
characteristics (e.g., latency, computation and memory de-
mands, resource utilization), which allows QoS estimation
in the early stage during DNN exploration.
Step2: QoR- and QoS-based bundle selection. To select the
most promising bundles, we first evaluate the QoR potential
of each bundle by replicating such bundle n time to construct
Figure 3. The proposed bi-directional co-design with a bottom-up DNN model exploration and a top-down accelerator design approach.
For DNN exploration, we start using the hardware-aware building templates (called Bundles), and grow the DNN to reach desired QoR;
For accelerator design, we follow the proposed architecture using bundle-reused tile-based pipeline, and optimize configurable parameters
to pursue the targeted QoS.
Table 1. DNNs for single object detection for 3×160×360 input
images using different backbones listed (without fully-connected
layers) but the same back-end for bounding box regression.
Backbone Para. Size (MB) IoU
ResNet-18 (He et al., 2016) 85 61%
ResNet-32 (He et al., 2016) 162 26%
ResNet-50 (He et al., 2016) 179 32%
VGG-16 (Simonyan et al., 2014) 56 25%
a prototype DNN. All prototype DNNs are fast trained (20
epochs) directly on the targeted dataset for accuracy results.
Based on the QoS estimation in step1, we group prototype
DNNs with similar QoS to the input targets and select the
top-n bundle candidates of each group.
Step3: Hardware-aware DNN exploration. By stacking the
selected bundle, we start exploring DNNs with a bottom-up
approach under given QoS and QoR constraints by using
stochastic coordinate descent (SCD). DNNs output from
SCD are precisely evaluated regarding their QoS and fed
back to SCD for DNN model update. The generated DNNs
that meet QoS targets are output for training and fine-tuning
to have improved QoR.
We propose a DNN accelerator which provides a tile-based
pipelined architecture for efficient implementation of DNN
applications with maximum resource sharing strategy. It
includes a folded structure to compute DNN bundles sequen-
tially by reusing the same hardware computing components
for resource saving when targeting compact IoT devices.
To ensure better QoS, it also uses an unfolded structure for
computing operations (partitioned by tiles) inside bundles
in a pipelined manner. With the combination of folded and
unfolded structure, the proposed architecture can acquire
advantages from both recurrent and pipelined structure.
4. Results and Conclusions
We demonstrate the proposed bi-directional co-design on
a real-life object detection task in DAC’18 System Design
Contest and generate three DNNs (A, B, and C in Table 2)
and corresponding accelerators on Pynq-Z1 FPGA for differ-
ent QoS-QoR combinations. The proposed co-design flow
Table 2. The proposed DNNs with different data precisions for
Weight and Feature map. The convolutional layers include depth-
wise (DW) 3×3 and point-wise (PW) 1×1 convolution with output
channel number shown in the bracket.
A (W16, F8) B (W16, F16) C (W11, F8)
input (3×160×360 color image)
DW-Conv3 (3)
PW-Conv1 (48)
2×2 max-pooling
DW-Conv3 (48)
PW-Conv1 (96)
2×2 max-pooling
DW-Conv3 (96)
PW-Conv1 (192)
2×2 max-pooling
DW-Conv3 (192)
PW-Conv1 (384)
PW-Conv1 (10)
DW-Conv3 (192)
PW-Conv1 (384)
PW-Conv1 (10)
DW-Conv3 (192)
PW-Conv1 (384)
DW-Conv3 (384)
PW-Conv1 (512)
PW-Conv1 (10)
Back-end for bounding box regression
Table 3. Result comparisons to the champion design in FPGA and
GPU track of DAC’18 System Design Contest (DAC, 2018)
Model IoU FPS Efficiency
The proposed DNN-A 59.3% 29.7 12.38 image/watt
The proposed DNN-B 61.2% 22.7 9.46 image/watt
The proposed DNN-C 68.6% 17.4 6.96 image/watt
Modified SSD (FPGA) 62.4% 12.0 2.86 image/watt
Modified Yolo (GPU) 69.8% 24.6 1.95 image/watt
first identifies that the bundle with DW-Conv3, PW-Conv1,
and max-pooling layers is the most promising building tem-
plate for the target hardware device and application. Based
on this bundle, the co-design explores three DNN configura-
tions with different quantization schemes to satisfy the QoR
demands, respectively. As shown in Table 3, we can deliver
the best FPS (29.7) and efficiency (12.38 image/watt) using
the same FPGA as the FPGA champion design. Among
them, the proposed DNN-C outperforms the FPGA winning
design in all aspects with 6.2% higher IoU, 1.45X higher
FPS, and 2.4X higher efficiency. Comparing to the GPU
winning design, the DNN-C design can deliver comparable
accuracy but 3.6X higher efficiency.
Acknowledgment
This work was partly supported by the IBM-Illinois Cen-
ter for Cognitive Computing System Research (C3SR) – a
research collaboration as part of IBM AI Horizons Network.
References
DAC System Design Contest. https:
//github.com/xyzxinyizhang/
2018-DAC-System-Design-Contest, 2018.
Aydonat, U., O’Connell, S., Capalija, D., Ling, A. C., and
Chiu, G. R. An Opencl deep learning accelerator on Arria
10. In Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pp. 55–
64. ACM, 2017.
Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D.,
Luo, H., Yao, S., Wang, Y., et al. Ese: Efficient speech
recognition engine with sparse lstm on fpga. In Proceed-
ings of the 2017 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pp. 75–84. ACM,
2017.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
A., et al. In-datacenter performance analysis of a tensor
processing unit. In 2017 ACM/IEEE 44th Annual Inter-
national Symposium on Computer Architecture (ISCA),
pp. 1–12. IEEE, 2017.
Li, H., Fan, X., Jiao, L., Cao, W., Zhou, X., and Wang, L.
A high performance FPGA-based accelerator for large-
scale convolutional neural networks. In 2016 26th Inter-
national Conference on Field Programmable Logic and
Applications (FPL), pp. 1–9. IEEE, 2016.
Li, Y., Yu, M., Li, S., Avestimehr, S., Kim, N. S., and
Schwing, A. Pipe-SGD: A Decentralized Pipelined SGD
Framework for Distributed Deep Net Training. In Pro-
ceedings of the 32nd Conference on Neural Information
Processing Systems (NIPS’18), Montreal, Canada, De-
cember 2018.
Simonyan, K. et al. Very deep convolutional networks
for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
Wang, J., Lou, Q., Zhang, X., Zhu, C., Lin, Y., and Chen,
D. Design flow of accelerating hybrid extremely low bit-
width neural network in embedded FPGA. In 2018 28th
International Conference on Field Programmable Logic
and Applications (FPL), pp. 163–1636. IEEE, 2018.
Zeng, H., Chen, R., Zhang, C., and Prasanna, V. A frame-
work for generating high throughput cnn implementations
on FPGAs. In Proceedings of the 2018 ACM/SIGDA
International Symposium on Field-Programmable Gate
Arrays, pp. 117–126. ACM, 2018.
Zhang, X., Wang, J., Zhu, C., Lin, Y., Xiong, J., Hwu, W.-
m., and Chen, D. DNNBuilder: an automated tool for
building high-performance DNN hardware accelerators
for FPGAs. In Proceedings of the International Confer-
ence on Computer-Aided Design, pp. 56. ACM, 2018.
