A System-Level Solution for Low-Power Object Detection by Li, Fanrong et al.
A System-Level Solution for Low-Power Object Detection
Fanrong Li1,2∗, Zitao Mo1,2∗, Peisong Wang1, Zejian Liu1,2∗,
Jiayun Zhang1∗, Gang Li1,2, Qinghao Hu1, Xiangyu He1,2, Cong Leng1,3,
Yang Zhang1,3, Jian Cheng1,2,3,4
1Institute of Automation, Chinese Academy of Sciences
2University of Chinese Academy of Sciences, 3AiRiA
4CAS Center for Excellence in Brain Science and Intelligence Technology
{lifanrong2017, mozitao2017}@ia.ac.cn, {gang.li, jcheng}@nlpr.ia.ac.cn
Abstract
Object detection has made impressive progress in recent
years with the help of deep learning. However, state-of-
the-art algorithms are both computation and memory in-
tensive. Though many lightweight networks are developed
for a trade-off between accuracy and efficiency, it is still a
challenge to make it practical on an embedded device. In
this paper, we present a system-level solution for efficient
object detection on a heterogeneous embedded device. The
detection network is quantized to low bits and allows effi-
cient implementation with shift operators. In order to make
the most of the benefits of low-bit quantization, we design a
dedicated accelerator with programmable logic. Inside the
accelerator, a hybrid dataflow is exploited according to the
heterogeneous property of different convolutional layers.
We adopt a straightforward but resource-friendly column-
prior tiling strategy to map the computation-intensive con-
volutional layers to the accelerator that can support arbi-
trary feature size. Other operations can be performed on
the low-power CPU cores, and the entire system is executed
in a pipelined manner. As a case study, we evaluate our
object detection system on a real-world surveillance video
with input size of 512×512, and it turns out that the system
can achieve an inference speed of 18 fps at the cost of 6.9W
(with display) with an mAP of 66.4 verified on the PASCAL
VOC 2012 dataset.
1. Introduction
Since AlexNet [7] won the 2012 large-scale image
recognition contest, Deep Convolutional Neural Networks
(DCNNs) have shown increasing performance in various
computer vision tasks. CNN’s impressive performance
is mainly due to its high complexity and capacity, in
∗These authors contributed equally.
other words, the great number of parameters and compu-
tations. Therefore, high-performance hardwares such as
GPUs (clusters) are often utilized for acceleration. How-
ever, as for embedded and mobile devices such as drones,
security cameras, and smart glasses, GPU-based solutions
are not the best choice due to the limitation of volume
and power consumption. In addition, modern GPUs that
designed for general propose processing are not flexible
enough to deal with low-bit integer values less than 8-bit
without efforts on tuning the codes. As a result, FPGA-
based accelerators are gaining popularity in recent years for
both industrial and academic communities.
As for memory efficiency, we find that the advantages of
the recent depthwise convolution [3, 5] are apparent. Un-
like traditional convolution, in depthwise convolution, each
output feature map relies solely on a single input feature
map in the previous layer, which dramatically reduces the
amount of computations and the demand of on-chip stor-
age. In terms of resource and energy efficiency, recent log-
arithmic computation [4, 12, 8] has shown its promise. It
quantizes the weight as power-of-two in order to efficiently
translate multiplication into bit shift operation, which can
get rid of the limitation of insufficient on-chip DSP blocks.
Considering the advantages of depthwise convolution
and logarithmic computation mentioned above, we put for-
ward an end-to-end hardware-software co-design for low-
power object detection on resource-constraint FPGA. Our
proposed solution can achieve relatively high performance
under extremely low resource budget while retaining con-
siderable accuracy. The contribution of this work can be
summarized as follows:
• We propose a dedicated object detection accelera-
tor for customized MobileNet-SSD [9, 5] algorithm
through software-hardware co-design. Specifically,
we quantize the activations and weights to 4-bit in-
teger and 3-bit power-of-two integer respectively, and
ar
X
iv
:1
90
9.
10
96
4v
2 
 [c
s.C
V]
  1
9 O
ct 
20
19
present a fused-layer architecture with shift-based pro-
cessing elements.
• We adopt a column-prior strategy to map the detection
network to the accelerator, which can reduce resource
consumption. Besides, a hybrid dataflow is introduced
to reuse output or weights according to the heteroge-
neous property of different layers.
• We highlight the entire pipeline of our heterogeneous
system design, including hardware accelerator, host
processing and thread management of the main pro-
cessor, and describe each stage in details.
• We verify the performance of our design on heteroge-
neous devices Ultra96 SoC that targets to IoT appli-
cations. Experiments show that the entire system can
reach an inference speed of 18 fps at the cost of around
6.9W.
The rest of the paper is organized as follows. Sec-
tion 2 describes the quantization algorithm, with which
we quantize weights to the power-of-two and enables
resource-friendly shift-base multiplications. Section 3
briefly presents the overall system architecture. Section
4 introduces the architecture of the dedicated accelerator,
including Processing Elements (PEs), tiling strategy, and
dataflow. Section 5 reports the experimental results as well
as multithread management on low-power CPUs.
2. Quantization
To make the CNN model compatible with our hardware
architecture design, we introduce a three-step quantization
method, i.e., uniform activation quantization, power-of-two
weight quantization as well as scale quantization, as illus-
trated in Figure 1. It is worth noting that through the pro-
posed three-step quantization, all computing can be trans-
formed into fixed-point operations, without any floating-
point values.
Activation Quantization
Weight Quantization
Scale Quantization
Figure 1. Three-Step Quantization Pipeline.
2.1. Uniform Activation Quantization
For M -bit activation quantization, we want to quan-
tize all the positive activations into the set A =
{0, 1, 2, · · · , 2M − 1}. As with many other fixed-point
quantization methods, we also introduce a scaling factor α
to lower the quantization error, making the quantization set
into
A = {0, 1, 2, · · · , 2M − 1} ∗ α
To turn all activations into fixed-point numbers, we can
quantize the floating-point activation to the nearest point in
the set A. The 2M − 1 quantization thresholds can be set to
the medians of two successive quantized values:
ti =
(i− 1)α+ iα
2
= (i− 1
2
)α
for i = 1, · · · , 2M − 1
(1)
Thus the quantization function Qa can be formulated as
QA(x) =
 (2
M − 1)α x > t2M−1,
iα x ∈ (ti, ti+1],
0 x ≤ t1
(2)
2.2. Power-of-Two Weight Quantization
For weight, we utilize power-of-two quantization. In this
way, the floating-point multiplications within the convolu-
tion can be transformed into shifting operations, which can
dramatically lower the complexity of CNN and hardware
design. The 4-D weight tensor consists of n kernels of size
w × h × c, which are quantized by using different scaling
factors. More specifically, the 4-D tensorW ∈ Rw×h×c×n
is reshaped into a matrixW ∈ R(w∗h∗c)×n, where each col-
umn wi ∈ Rw∗h∗c corresponds to a 3-D kernel. To lower
the quantization error, a floating-point scaling factor βi is
introduced for each kernel wi, i.e., for N -bit quantization,
the problem is to select weight values from the set
Bi = {0,±20,±21, · · · ,±22N−1−2} ∗ βi
Here we also use the nearest quantization and the 2N −2
quantization thresholds can also be determined by the medi-
ans of two successive quantized values, as in the activation
quantization.
2.3. Scale Quantization
By activation and weight quantization, the convolution
can be performed with only fixed-point operations. How-
ever, the whole network still requires floating-point opera-
tions due to the introduced scaling factors, bias term of con-
volution, as well as some other layers like Batch Normal-
ization. To further eliminate the above mentioned floating-
point operations, we introduce the scale quantization, which
consists of two parts:
Scale merge: For the l-th layer, the input activation X
can be represented by X = αXˆ , where Xˆ is the fixed-
point version of X and α is the scaling factor. Similarly,
the w = βwˆ where wˆ is one of the fixed-point kernels.
Table 1. Quantization results on ImageNet classification (top-1 ac-
curacy). The #Act., #Wei. and #Sca. represent the number of bits
for activations, weights, and scaling factors, respectively.
Model #Act. #Wei. #Sca. Accuracy
MobileNet Full Full Full 70.1
MobileNet 8 3 8 68.3
MobileNet 4 3 8 68.1
For simplicity, we discard the kernel index. Considering
the Batch Normalization term, the convolutional layer can
be represented by the following equation:
Y = α′Yˆ = QA(BN(βwˆ ⊗ αXˆ))
= QA(γαβwˆ ⊗ Xˆ + b)
= QA(awˆ ⊗ Xˆ + b)
(3)
where Y is the output activation, Yˆ is the fixed-point ver-
sion of output activations, and the α′ is the scaling factor for
outputs. BN(x) = γx+ b is the batch normalization layer
and ⊗ is the convolution.
To further merge out the output scaling factor, we can
divide both sides of Eq. 3 by α′, resulting in the following
equation:
Yˆ = QˆA(
a
α′
wˆ ⊗ Xˆ + b
α′
))
= QˆA(a
′wˆ ⊗ Xˆ + b′)
(4)
Note that in the activation quantization function need to
be changed accordingly. By defining tˆi = tiα′ , the new quan-
tization function becomes:
QˆA(x) = round(clip(x, 0, 2
M − 1)), (5)
where round(x) is the rounding operation, and clip(x, u, v)
clips x within u and v.
Scale quantization: In Eq. 4, only the a′ and b′ are
floating-points. Note that Eq. 4 only coresponds to one 3-D
kernel, for the convolutional layer, there are n pairs of a′
and b′, denoted by a′ and b′. In the scale quantization, we
need to quantize these values into fixed-point numbers.
During the scale quantization, no scaling factors could
be incorporated. However, direct quantizing of a′ and b′
will introduce large quantization error. Here we search for
the binary point position, resulting in the following set to be
quantized into:
C = {0,±1,±2, · · · ,±2K−1 − 1} ∗ 2d,
where d represents the binary point position. More specifi-
cally, when d go throught from 0 to -15, we find the best d
that minimize the quantization error for a′ and b′.
215MHz  
CLK
2GB LPDDR4 
Memory
Memory
ARM  
Cortex-A53
ARM  
Cortex-A53
ARM  
Cortex-A53
ARM  
Cortex-A53
1
2
3
4
GIC
APU
AXI_LITE AXI_MMIRQ
Co
-P
ro
ce
ss
or Buffer
DMA
Processing Element
PE_33PE_33PE_33PE_11
PE_33PE_33PE_33PE_33
PE_33PE_33PE_33PE_HEAD
PE_33PE_33PE_33PE_DW
Co
nt
ro
lle
r
AXI_MMIRQAXI_LITE
Figure 2. Architecture of the entire system.
2.4. Optimization
The optimization problem can be solved efficiently us-
ing Lloyd’s algorithm. Take the activation quantization
problem of section 2.1 for example, during the assignment
step, all activation data points are quantized into the nearest
fixed-point values in the set of A according to the quanti-
zation function QA(x). In the update step, the new scaling
factor can be obtained by solving a one-dimensional opti-
mization problem:
α∗ = argmin
α
∑
x
(x−QA(x))2 (6)
By iterative quantization, we could find the optimal scaling
factors as well as the quantized values.
After the activation quantization and weight quantiza-
tion, we need to fine-tune the whole network to restore ac-
curacy.
2.5. Performance
The experiments are conducted on the ImageNet clas-
sification benchmark, results are shown in Table 1. The
results illustrate that the three-step quantization approach
has only minimal accuracy drop compared with the floating-
point counterpart.
3. System Architecture
Our detection network targets to run on the Ultra96 de-
velopment board, which is a heterogeneous embedded sys-
tem containing both programmable logic and low-power
CPU cores. A 2GB DDR4 is shared by Programmable
Logic (PL) and Processing System (PS). Since convolu-
tional layers dominate most of the inference time, we imple-
ment a dedicated CNN accelerator with the Programmable
Logic.
The entire system includes the following functional lay-
ers. Data forward layer: decode video streams. Encode
layer: organize data into the specific pattern for FPGA ac-
celerator. FPGA layer: perform all convolution on the dedi-
cated accelerator. Decode layer: organize extracted features
from the accelerator to the storage pattern for CPU. Mbox-
conf-reshape layer: reshape bounding boxes. Mbox-conf-
softmax layer: softmax layers of the detection. Mbox-conf-
flatten layer: reshape data. Detection and visualize layer:
generate detecting results and display on the screen. All
the layers except for FPGA layer are executed on CPU. All
operations before the FPGA layers are referred to as pre-
processing, while those operations after the FPGA layer are
post-processing.
At the very beginning, images together with the weights
and instructions of a specific CNN are stored in DDR. The
CPU initiates a calculation request and transfer instructions
to the accelerator through AXI. The accelerator receives in-
structions and completes all convolution computation. Note
that the accelerator has its own instruction set, and it can
complete the calculations independently unless interrupted
by exceptions. Results of the FPGA layer are sent back
to CPU for post-processing. Multi-thread technique is ex-
ploited to make the most use of 4 low-power ARM cores.
The entire system works in a pipelined manner, and the sys-
tem architecture is shown in Figure 2.
4. Dedicated Accelerator
In this section, we first describe the overall architecture
of our accelerator, which exploits multiple PEs for high
computing parallelism. Then the design of PE is introduced.
After that, the column-prior tiling strategy is presented to
support the arbitrary size of input feature maps under lim-
ited resources. Finally, a hybrid dataflow is proposed for
more efficiency.
4.1. Overall Architecture
Figure 3 shows the overall architecture of our acceler-
ator with different types of PEs inside. The Co-Processor
module controls the entire computation flow. It parses in-
structions to generate control information for the Memory
Controller and different kinds of PEs. The addresses of ac-
tivations and weights are calculated by the Memory Con-
troller, with which all kinds of data can be sent to the proper
destinations. Prefetching is enabled since we implement
a 4KB instructions cache inside the Co-processor. Note
that some cache features are unavailable in this design be-
cause they are unnecessary for a specific accelerator with-
out branch and jump instructions. Controllers for different
types of PEs generate control signals according to the con-
trol information received from Co-Processor. IARAM and
PE_33
PE_11
PE_DW
PE_HEAD
Controller
Processing Element
Memory 
Controller
Weight 
Buffer
WRAM 
96KB IARAM_0 96KB
IARAM_1 
96KB
IARAM_2 
96KB
OARAM 
256KB
IA
RA
M
Inter 
RAM 
16KB
Co
-P
ro
ce
ss
or
ICache 
4KB
Programmable Logic
DMA
Pr
oc
es
si
ng
 S
ys
te
m
Figure 3. Architecture of dedicated CNN accelerator with only one
for each type of PEs.
OARAM are used to store the intermediate feature maps
during computation, where IARAM is implemented with
three banks, providing sufficient bandwidth to complete the
3 × 3 convolution more efficiently. And the IARAMs and
OARAM can be logically swapped between the computa-
tion of two adjacent layers. We implement two-level weight
caches (Weight buffer and WRAM) with on-chip registers
and BRAMs, which can provide sufficient bandwidth for
computing.
4.2. Processing Elements
Heterogeneous nature of 1×1 convolution and depthwise
convolution may make the reuse of processing elements
costly, so reusing PEs does not necessarily lead to benefits
and is contrary to our original intention to design a dedi-
cated low-power accelerator. Therefore, PEs are special-
ized for different kinds of convolutional layers, i.e., 3×3
convolution (PE 33), 1×1 convolution (PE 11), and depth-
wise convolution (PE DW) for the consideration of reduc-
ing the control complexity and improving hardware effi-
ciency. To efficiently compute the location offsets in the de-
tection algorithm, PE HEAD is necessary. Each type of PEs
is mainly composed of multipliers and reduction trees, as
well as modules that can selectively execute the ReLU and
Batch Normalization functions. Each PE processes with
only one kernel at a time.
Different from some previous work using line buffer,
we implement 3×3 convolution in PE 33 more efficiently,
as shown in Figure 4. The input image is divided into
three parts according to row number and stored in three
IARAMs. During the computation, inputs in three continu-
ous rows can be fetched from different IARAMs simultane-
ously. Compared to line buffer implementation, it reduces
data-preparing time and register consumption. Besides, as
for the 3×3 convolution with stride=2, each IARAM can
provide higher bandwidth to support jump connection for
(a) Line buffer convolution.
(b) PE 33, stride=1. (c) PE 33, stride=2.
Figure 4. The implementation for 3×3 convolution with different
strategies: (a) Line buffer convolution; (b) Our implementation of
3×3 convolution with stride=1; (c) Our implementation of 3×3
convolution with stride=2
the registers, as shown in Figure 4(b). Therefore, only the
necessary calculations are performed, which can achieve
4× speedup than the original convolution based on classic
line buffer.
Depthwise convolutional layer can be fused with its ad-
jacent layers in a pipelined manner to speedup computation
due to its less data-dependent property. With this insight,
in this work, we introduce two types of cascaded PEs to the
architecture of our accelerator, which can be summarized as
follows.
• PE 33, PE DW. The results of 3×3 convolution can
be sent to PE DW directly. Different from PE 33,
PE DW are processing with line buffer to accommo-
date the continuous inflow of data. This manner works
in conjunctions with our column-prior tiling strategy
to reduce the consumption of registers, which we will
present in section 4.3.
• PE 11, PE DW. Similarly, 1×1 convolution and
depthwise convolution can also be processed in a fused
manner. During computation, input activations are
fetched from one of three input buffers, and the results
of 1×1 convolution are sent to PE DW immediately
and processed on the fly. The final results are written
back to the corresponding output buffer.
As mentioned in section 2, activations and weights of the
network are quantized to low bits. Specifically, the weights
are quantized to power-of-two, which enables us to replace
multipliers with shift operators. Compared with normal
multiplications, it can reduce resource and power consump-
tion. We conduct an experiment to verify the benefits of this
shift-based multipliers, which shows that shift-based multi-
LU
T
0
1000
2000
3000
4000
Numbers of multipliers  
16 32 64 128
3,537
1,746
867
428
3,409
1,682
835
412
2,146
1,043
516
253
MAC-Shift
MAC-Base
MAC-IP
Figure 5. LUT consumption of different implementations of
MACs. MAC-Shift (activation 4b/weight 3b) is our implementa-
tion of multipliers using shift operations, while MAC-Base (4b/4b)
is direct multiplication and MAC-IP (4b/4b) is multiplications us-
ing Xilinx IP. Reduction trees are also included in all three cases.
Note that if we use multipliers, we have to use 4b/4b inputs in
order to represent numbers from -4 to 4.
Table 2. Notation for tiling strategy and dataflow.
Variables Descriptions
WT width (column) of a tile of feature maps
HT height (row) of a tile of feature maps
KT parallelism on output channel dimension
CT parallelism on the input channel dimension
Nk number of tiles along the filter dimension
Nc number of tiles along the channel dimension
plication can reduce the usage of LUT by approximately
40%, as shown in Figure 5.
4.3. Column-Prior Tiling Strategy
Under the limited on-chip resources, tiling is necessary
to map convolutional layers to the accelerator. We adopt a
column-prior tiling strategy, as shown in Figure 6, which
can reduce both latency and register consumption. We take
a feature map with size 256×256 as an example, which is
expected to be divided into two parts to fit into the limited
on-chip buffers. As for the row-prior manner, a tile with
size 128×256 is generated after 1×1 convolution and can
be sent to PE DW immediately for the processing of depth-
wise convolution. In this situation, at least 2×256+3 = 515
registers are required for applying line buffer convolution.
However, if the feature maps are divided into the size of
256×128 in a column-prior manner, only 2×128+3 = 259
registers are needed. Thus register consumption can be ap-
proximately halved. Similarly, invalid cycles caused by fill-
ing registers are also reduced, which will also be beneficial
to latency and efficiency.
Since the feature maps are divided into several tiles by
column index, overlapping between adjacent tiles are in-
troduced. Suppose that we can obtain output tiles with
five valid columns after 1×1 and depthwise convolution
Tc1 Tc2 Tc3
⇒
Tc1 + 1 Tc2 + 2 Tc3 + 1
⇒
Tc1 Tc2 Tc3
Conv 
1x1
Conv 
3x3 DW
Figure 6. A particular case of column-prior tiling strategy applied
to 1×1 and depthwise convolution (stride=1). The input feature
maps are divided into three tiles and transferred from DDR to on-
chip BRAMs sequentially. As shown in the middle of the figure,
extra features columns from adjacent tiles are necessary.
Algorithm 1: Output stationary dataflow for 1×1 con-
volution.
for h = 0 : HT do
for w = 0 :WT do
for nk = 0 : Nk do
for nc = 0 : Nc do
Parallel for k = (nk-1)Kt : nkKt do
Parallel for c = (nc-1)CT : ncCT do
P = I[c][w][h] ∗W [k][c];
O[k][w][h] = O[k][w][h] + P ;
// Partial sums keep stationary in PE
until a valid output is obtained;
// Sent O[(nk-1)Kt : nkKt][w][h] to Out buf;
Algorithm 2: Weight stationary dataflow for 1×1 con-
volution.
for nk = 0 : Nk do
for nc = 0 : Nc do
// Fetch weights from weight buffer and keep weights
stationary in PEs;
for h = 0 : HT do
for w = 0 :WT do
Parallel for k = (nk-1)Kt : nkKt do
Parallel for c = (nc-1)CT : ncCT do
P = I[c][w][h] ∗W [k][c];
O[k][w][h] = O[k][w][h] + P ;
// Keep partial sums in Inter buffer;
// Sent O[(nk-1)Kt : nkKt][:][:] to out buf;
(stride=1), the input tiles should contain seven valid val-
ues in each row. During the processing, a column of input
features from the last tile is needed.
4.4. Hybrid Dataflow
Although column-prior tiling strategy is utilized for the
efficiency of the accelerator, the on-chip buffer require-
ment and memory accesses depend heavily on the dataflow
of computations [1, 2]. The output stationary, as well as
the weight stationary, is the most commonly used dataflow
in previous designs. Algorithm 1 and 2 illustrate both
dataflows, respectively, where the parameters are shown in
Table 2.
• Output stationary dataflow. Input activations and
weights are fed into the PE array continuously, and
the partial sums are held in PEs until the final results
are available. These final results are either passed to
PE DW for the following computation or stored in the
IARAMs/OARAM. Since each output is completed
after weights in a filter have been calculated, higher
bandwidth is required for weight transmission. In ad-
dition, because of the implementation of weight buffer,
there are more opportunities for weights to be reused.
• Weight stationary dataflow. Each PE holds part of
weights for reuse until finishing the computation with
input activations in the corresponding channels. And
the partial sums generated in each PE are stored to the
Inter RAM. Only if the kernel group is completed can
the final results be sent to IARAMs/OARAM. In this
way, weights can be reused as many as possible, but
the accelerator requires additional storage, i.e., Inter
RAM.
Although our accelerator is specialized for compact de-
tection network, different convolutional layers (1×1 con-
volution and depthwise convolution) still present hetero-
geneous property (e.g. width, height, and channel size).
The dimensions of feature maps near to the input are rel-
atively large. Thus these layers require more on-chip buffer
to store the activations, while weights require less stor-
age. In this case, there are more opportunities for weights
to be reused, which is more suitable for output stationary
dataflow. However, in the deeper layers, weights become
much more intensive in memory, because output stationary
dataflow needs to fetch all the weights of a kernel to the PE
to calculate each output. If the weight buffer can not accom-
modate those weights, weights are required to be fetched
multiple times during the processing, leading to more en-
ergy consumption. In other words, we need a larger weight
buffer to reuse weights.
Therefore, we consider a hybrid dataflow that makes a
balance between the weight reuse and weight buffer re-
quirements to get the best performance and energy on the
resource-limited computing platform. In most of the early
layers, we adopt the output stationary dataflow. Thus, all the
weights of a kernel group can be reused in weight buffer,
and they are fetched from WRAM only once during the
processing of a layer. The case becomes different as the
network goes deeper, and the weight stationary dataflow is
adopted. So the weight buffer requirement can be signifi-
cantly reduced with only a small Inter RAM overhead.
With the help of Co-Processor, our accelerator is flexible
enough to support these two types of dataflow according to
the size of kernels.
Table 3. Configurations of each type of PEs and the overall re-
source utilization on Ultra96 development board.
Parameters Kt CT Precision(A/W/O) Operations
PE 33 8 3 8/3/4 bits Conv 3×3
PE 11 16 32 4/3/4 bits Conv 1×1
PE DW 16 16 4/3/4 bits Conv 3×3 DW
PE Head 2 32 4/8/16 bits Conv 1×1
Resource Available Used Utilization
LUT 70560 50485 71.55%
FF 141120 74174 52.56%
BRAM 216 178.50 82.86%
DSP 360 83 23.06%
5. Experiments
We implement our solution on the Ultra96 development
board with Xilinx Zynq UltraScale+ MPSoC. The accelera-
tor runs at a frequency of 215 MHz with clock gating to each
type of PE. Power measurement is obtained via a power
monitor. We measured the power of approximate 6.9W on
the Ultra96 when processing the detection task with the im-
age size of 512 × 512. The configurations of each type of
PE and the overall resource utilization are shown in Table 3,
in which we also list the supported precision of activations
(A), weights (W), and outputs (O) respectively. It shows
that less than 25% of the total on-chip DSPs are used on the
FPGA since most of the multiplications are implemented
as shift operations using LUTs. Most of the registers are
used as weight buffer while BRAMs are mainly used for
data buffer and the WRAM. With limited programmable
resources on Ultra96 board, the whole system reaches an
inference speed of 18 fps. Results are reported when the
system is detecting objects from a video. Table 4 shows the
specification of the entire system.
Although FPGA undertakes most of the computations in
detection algorithm, we find that pre-processing and post-
processing on CPUs still account for most of the inference
time, as shown in Figure 7 (a). In order to overcome the bot-
tleneck of CPU execution, we adopt a pipelined task man-
agement with multi-thread techniques. In this way, the total
latency is reduced, and FPGA layers dominate most of the
inference time, as shown in Figure 7 (b).
Thread assignments are conducted empirically. Figure
7(c) presents the detailed time breakdown of each layer.
The latency can vary greatly depending on the input image
because the number of objects within an image varies sig-
nificantly and thus influence the computational complexity
in the post-processing phase. Therefore, time breakdown
in Figure 7 is obtained by averaging over a batch of im-
ages. As shown in the figure, the softmax layer is the most
Table 4. System specification.
Device Ultra96 development board
Network customized MobileNet-SSD
Quantization activation 4b/weight 3b
Power 6.9Watt
Frame rate 18 fps
Accelerator frame rate 27 fps
mAP on VOC 2012 66.4
time-consuming among all the layers, while the data for-
ward layer and visualization layer account for 34% of the la-
tency. Note that in a real-world application such as ADAS,
the detection results are used as part of the control system,
in which visualization may not be necessary. In this situa-
tion, the latency of CPUs can be further reduced, pushing
the system frame rate towards the maximum.
Figure 8 shows a demo of our proposed object detec-
tion system. As we can see, the measured power is around
6.9W, and there are slight fluctuations as the detected im-
age changes. Most of the targets are correctly detected (e.g.
pedestrian, cars), frame rate for FPGA layers is around 25-
30.
Table 5. Comparison with other accelerators
VGG ACC[10] Low-Bit[6] Synetgy[11] Ours
Precision 16/16 bits 2/1 bits 4/1 bits 4/3 bits(A/W)
Platform Zynq Zynq Zynq ZynqXC7Z045 XC7Z020 ZU3EG ZU3EG
Frequency 150 200 250 215(MHz)
Network VGG-16 DoReFa ShuffleNetV2 MobileNet
Classification 64.64% 46.10% 68.47% 68.1%Top 1 Acc
Performance 136.97 410.22 47.09∼418 202.76(GOPs)
As shown in Table 5, we also compare our accelera-
tor against previous works. Since the previous works are
mainly designed for image classification, we also evaluate
the performance of our customized MobileNet on ImageNet
classification task. Compared with VGG ACC, which is
implemented with 16-bits integers, our design can achieve
better performance and accuracy even on a smaller FPGA.
Low-Bit is implemented with lower bits, which leads to se-
vere accuracy degradation. Synetgy uses shift operations
to replace the spatial convolutions. It can achieve high
accuracy with lower bits, i.e., 4-bits activations and 1-bit
weights. However, our accelerator can achieve more stable
performance with comparable accuracy.
6. Conclusion
In this paper, we present a system-level solution for ob-
ject detection on the heterogeneous embedded system. We
quantize the compact detection network to low bits, which
12%2%
1%
28%
1%
2% 27%
6%
22%
data forward
encode
fpga
decode
mbox_conf_reshape
mbox_conf_softmax
mbox_conf_flatten
detection
visualization
73%
27%
fpga
others
35%
65%
fpga
others
(a) (b) (c)
Figure 7. (a) Time breakdown before pipeline. (b) Time breakdown after pipeline. (c) Detailed time breakdown of each layer before
pipeline. Figure (a) and (b) demonstrate that with the heterogeneous pipeline, the overall latency is reduced, thus the proportion of FPGA
layers becomes larger, dominates most of the latency. Figure (c) shows that data forward layer and mbox-conf-softmax layer are the most
time-consuming layer and require more threads to process.
Ultra 96 Board
Power Monitor
Figure 8. A demo of our proposed object detection system.
allows us to replace multiplications with efficient shift op-
erations. A dedicated CNN accelerator is implemented to
carry out convolution computation. In order to support the
arbitrary size of input feature maps under limited resources,
we adopt a column-prior tiling strategy to map the convo-
lutional layer to the accelerator. Compared to row-prior
tiling strategy, it can reduce both register consumption and
latency. According to the heterogeneous properties of dif-
ferent layers, we provide a hybrid dataflow, with which we
can flexibly reuse the partial sums or filter weights. Multi-
thread is also exploited to accelerate the pre-processing and
post-processing. We believe that such an efficient and low
energy system can play a role in IoT applications.
Acknowledgment
This work was supported by the Strategic Priority Re-
search Program of Chinese Academy of Sciences (Grant
No. XDB32050200) and National Natural Science Foun-
dation of China (Grant No.61972396, 61906193).
References
[1] Y.-H. Chen, J. S. Emer, and V. Sze. Eyeriss: A spatial archi-
tecture for energy-efficient dataflow for convolutional neural
networks. In ISCA, 2016.
[2] Y.-H. Chen, T.-J. Yang, J. S. Emer, and V. Sze. Eyeriss v2:
A flexible accelerator for emerging deep neural networks on
mobile devices. IEEE Journal on Emerging and Selected
Topics in Circuits and Systems, 9:292–308, 2018.
[3] F. Chollet. Xception: Deep learning with depthwise separa-
ble convolutions. In CVPR, 2016.
[4] D. Gudovskiy and L. Rigazio. ShiftCNN: Generalized low-
precision architecture for inference of convolutional neural
networks. arXiv preprint arXiv:1706.02393, 2017.
[5] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision appli-
cations. arXiv preprint arXiv:1704.04861, 2017.
[6] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang. Accelerating
low bit-width convolutional neural networks with embedded
fpga. In FPL, 2017.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS, 2012.
[8] E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S.
Wong. Lognet: Energy-efficient neural networks using loga-
rithmic computation. In ICASSP, 2017.
[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
In ECCV, 2016.
[10] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu,
T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. Going
deeper with embedded fpga platform for convolutional neu-
ral network. In FPGA, 2016.
[11] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gam-
bardella, M. Blott, L. Lavagno, K. Vissers, J. Wawrzynek,
and K. Keutzer. Synetgy: Algorithm-hardware co-design for
convnet accelerators on embedded fpgas. In FPGA, 2019.
[12] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremen-
tal network quantization: Towards lossless cnns with low-
precision weights. arXiv preprint arXiv:1702.03044, 2017.
