Object detection has made impressive progress in recent years with the help of deep learning. However, state-ofthe-art algorithms are both computation and memory intensive. Though many lightweight networks are developed for a trade-off between accuracy and efficiency, it is still a challenge to make it practical on an embedded device. In this paper, we present a system-level solution for efficient object detection on a heterogeneous embedded device. The detection network is quantized to low bits and allows efficient implementation with shift operators. In order to make the most of the benefits of low-bit quantization, we design a dedicated accelerator with programmable logic. Inside the accelerator, a hybrid dataflow is exploited according to the heterogeneous property of different convolutional layers. We adopt a straightforward but resource-friendly columnprior tiling strategy to map the computation-intensive convolutional layers to the accelerator that can support arbitrary feature size. Other operations can be performed on the low-power CPU cores, and the entire system is executed in a pipelined manner. As a case study, we evaluate our object detection system on a real-world surveillance video with input size of 512×512, and it turns out that the system can achieve an inference speed of 18 fps at the cost of 6.9W (with display) with an mAP of 66.4 verified on the PASCAL VOC 2012 dataset.
Introduction
Since AlexNet [7] won the 2012 large-scale image recognition contest, Deep Convolutional Neural Networks (DCNNs) have shown increasing performance in various computer vision tasks. CNN's impressive performance is mainly due to its high complexity and capacity, in * These authors contributed equally. other words, the great number of parameters and computations. Therefore, high-performance hardwares such as GPUs (clusters) are often utilized for acceleration. However, as for embedded and mobile devices such as drones, security cameras, and smart glasses, GPU-based solutions are not the best choice due to the limitation of volume and power consumption. In addition, modern GPUs that designed for general propose processing are not flexible enough to deal with low-bit integer values less than 8-bit without efforts on tuning the codes. As a result, FPGAbased accelerators are gaining popularity in recent years for both industrial and academic communities.
As for memory efficiency, we find that the advantages of the recent depthwise convolution [3, 5] are apparent. Unlike traditional convolution, in depthwise convolution, each output feature map relies solely on a single input feature map in the previous layer, which dramatically reduces the amount of computations and the demand of on-chip storage. In terms of resource and energy efficiency, recent logarithmic computation [4, 12, 8] has shown its promise. It quantizes the weight as power-of-two in order to efficiently translate multiplication into bit shift operation, which can get rid of the limitation of insufficient on-chip DSP blocks.
Considering the advantages of depthwise convolution and logarithmic computation mentioned above, we put forward an end-to-end hardware-software co-design for lowpower object detection on resource-constraint FPGA. Our proposed solution can achieve relatively high performance under extremely low resource budget while retaining considerable accuracy. The contribution of this work can be summarized as follows:
• We propose a dedicated object detection accelerator for customized MobileNet-SSD [9, 5] algorithm through software-hardware co-design. Specifically, we quantize the activations and weights to 4-bit integer and 3-bit power-of-two integer respectively, and present a fused-layer architecture with shift-based processing elements.
• We adopt a column-prior strategy to map the detection network to the accelerator, which can reduce resource consumption. Besides, a hybrid dataflow is introduced to reuse output or weights according to the heterogeneous property of different layers.
• We highlight the entire pipeline of our heterogeneous system design, including hardware accelerator, host processing and thread management of the main processor, and describe each stage in details.
• We verify the performance of our design on heterogeneous devices Ultra96 SoC that targets to IoT applications. Experiments show that the entire system can reach an inference speed of 18 fps at the cost of around 6.9W.
The rest of the paper is organized as follows. Section 2 describes the quantization algorithm, with which we quantize weights to the power-of-two and enables resource-friendly shift-base multiplications. Section 3 briefly presents the overall system architecture. Section 4 introduces the architecture of the dedicated accelerator, including Processing Elements (PEs), tiling strategy, and dataflow. Section 5 reports the experimental results as well as multithread management on low-power CPUs.
Quantization
To make the CNN model compatible with our hardware architecture design, we introduce a three-step quantization method, i.e., uniform activation quantization, power-of-two weight quantization as well as scale quantization, as illustrated in Figure 1 . It is worth noting that through the proposed three-step quantization, all computing can be transformed into fixed-point operations, without any floatingpoint values. 
Uniform Activation Quantization
For M -bit activation quantization, we want to quantize all the positive activations into the set A = {0, 1, 2, · · · , 2 M − 1}. As with many other fixed-point quantization methods, we also introduce a scaling factor α to lower the quantization error, making the quantization set into
To turn all activations into fixed-point numbers, we can quantize the floating-point activation to the nearest point in the set A. The 2 M − 1 quantization thresholds can be set to the medians of two successive quantized values:
Thus the quantization function Q a can be formulated as
Power-of-Two Weight Quantization
For weight, we utilize power-of-two quantization. In this way, the floating-point multiplications within the convolution can be transformed into shifting operations, which can dramatically lower the complexity of CNN and hardware design. The 4-D weight tensor consists of n kernels of size w × h × c, which are quantized by using different scaling factors. More specifically, the 4-D tensor W ∈ R w×h×c×n is reshaped into a matrix W ∈ R (w * h * c)×n , where each column w i ∈ R w * h * c corresponds to a 3-D kernel. To lower the quantization error, a floating-point scaling factor β i is introduced for each kernel w i , i.e., for N -bit quantization, the problem is to select weight values from the set
Here we also use the nearest quantization and the 2 N − 2 quantization thresholds can also be determined by the medians of two successive quantized values, as in the activation quantization.
Scale Quantization
By activation and weight quantization, the convolution can be performed with only fixed-point operations. However, the whole network still requires floating-point operations due to the introduced scaling factors, bias term of convolution, as well as some other layers like Batch Normalization. To further eliminate the above mentioned floatingpoint operations, we introduce the scale quantization, which consists of two parts:
Scale merge: For the l-th layer, the input activation X can be represented by X = αX, whereX is the fixedpoint version of X and α is the scaling factor. Similarly, the w = βŵ whereŵ is one of the fixed-point kernels. For simplicity, we discard the kernel index. Considering the Batch Normalization term, the convolutional layer can be represented by the following equation:
where Y is the output activation,Ŷ is the fixed-point version of output activations, and the α is the scaling factor for outputs. BN (x) = γx + b is the batch normalization layer and ⊗ is the convolution.
To further merge out the output scaling factor, we can divide both sides of Eq. 3 by α , resulting in the following equation:Ŷ
Note that in the activation quantization function need to be changed accordingly. By definingt i = ti α , the new quantization function becomes:
where round(x) is the rounding operation, and clip(x, u, v) clips x within u and v. Scale quantization: In Eq. 4, only the a and b are floating-points. Note that Eq. 4 only coresponds to one 3-D kernel, for the convolutional layer, there are n pairs of a and b , denoted by a and b . In the scale quantization, we need to quantize these values into fixed-point numbers.
During the scale quantization, no scaling factors could be incorporated. However, direct quantizing of a and b will introduce large quantization error. Here we search for the binary point position, resulting in the following set to be quantized into:
where d represents the binary point position. More specifically, when d go throught from 0 to -15, we find the best d that minimize the quantization error for a and b . 
Optimization
The optimization problem can be solved efficiently using Lloyd's algorithm. Take the activation quantization problem of section 2.1 for example, during the assignment step, all activation data points are quantized into the nearest fixed-point values in the set of A according to the quantization function Q A (x). In the update step, the new scaling factor can be obtained by solving a one-dimensional optimization problem:
By iterative quantization, we could find the optimal scaling factors as well as the quantized values. After the activation quantization and weight quantization, we need to fine-tune the whole network to restore accuracy.
Performance
The experiments are conducted on the ImageNet classification benchmark, results are shown in Table 1 . The results illustrate that the three-step quantization approach has only minimal accuracy drop compared with the floatingpoint counterpart.
System Architecture
Our detection network targets to run on the Ultra96 development board, which is a heterogeneous embedded system containing both programmable logic and low-power CPU cores. A 2GB DDR4 is shared by Programmable Logic (PL) and Processing System (PS). Since convolutional layers dominate most of the inference time, we imple-ment a dedicated CNN accelerator with the Programmable Logic.
The entire system includes the following functional layers. Data forward layer: decode video streams. Encode layer: organize data into the specific pattern for FPGA accelerator. FPGA layer: perform all convolution on the dedicated accelerator. Decode layer: organize extracted features from the accelerator to the storage pattern for CPU. Mboxconf-reshape layer: reshape bounding boxes. Mbox-confsoftmax layer: softmax layers of the detection. Mbox-confflatten layer: reshape data. Detection and visualize layer: generate detecting results and display on the screen. All the layers except for FPGA layer are executed on CPU. All operations before the FPGA layers are referred to as preprocessing, while those operations after the FPGA layer are post-processing.
At the very beginning, images together with the weights and instructions of a specific CNN are stored in DDR. The CPU initiates a calculation request and transfer instructions to the accelerator through AXI. The accelerator receives instructions and completes all convolution computation. Note that the accelerator has its own instruction set, and it can complete the calculations independently unless interrupted by exceptions. Results of the FPGA layer are sent back to CPU for post-processing. Multi-thread technique is exploited to make the most use of 4 low-power ARM cores. The entire system works in a pipelined manner, and the system architecture is shown in Figure 2 
Dedicated Accelerator
In this section, we first describe the overall architecture of our accelerator, which exploits multiple PEs for high computing parallelism. Then the design of PE is introduced. After that, the column-prior tiling strategy is presented to support the arbitrary size of input feature maps under limited resources. Finally, a hybrid dataflow is proposed for more efficiency. Figure 3 shows the overall architecture of our accelerator with different types of PEs inside. The Co-Processor module controls the entire computation flow. It parses instructions to generate control information for the Memory Controller and different kinds of PEs. The addresses of activations and weights are calculated by the Memory Controller, with which all kinds of data can be sent to the proper destinations. Prefetching is enabled since we implement a 4KB instructions cache inside the Co-processor. Note that some cache features are unavailable in this design because they are unnecessary for a specific accelerator without branch and jump instructions. Controllers for different types of PEs generate control signals according to the control information received from Co-Processor. IARAM and OARAM are used to store the intermediate feature maps during computation, where IARAM is implemented with three banks, providing sufficient bandwidth to complete the 3 × 3 convolution more efficiently. And the IARAMs and OARAM can be logically swapped between the computation of two adjacent layers. We implement two-level weight caches (Weight buffer and WRAM) with on-chip registers and BRAMs, which can provide sufficient bandwidth for computing.
Overall Architecture

Processing Elements
Heterogeneous nature of 1×1 convolution and depthwise convolution may make the reuse of processing elements costly, so reusing PEs does not necessarily lead to benefits and is contrary to our original intention to design a dedicated low-power accelerator. Therefore, PEs are specialized for different kinds of convolutional layers, i.e., 3×3 convolution (PE 33), 1×1 convolution (PE 11), and depthwise convolution (PE DW) for the consideration of reducing the control complexity and improving hardware efficiency. To efficiently compute the location offsets in the detection algorithm, PE HEAD is necessary. Each type of PEs is mainly composed of multipliers and reduction trees, as well as modules that can selectively execute the ReLU and Batch Normalization functions. Each PE processes with only one kernel at a time.
Different from some previous work using line buffer, we implement 3×3 convolution in PE 33 more efficiently, as shown in Figure 4 . The input image is divided into three parts according to row number and stored in three IARAMs. During the computation, inputs in three continuous rows can be fetched from different IARAMs simultaneously. Compared to line buffer implementation, it reduces data-preparing time and register consumption. Besides, as for the 3×3 convolution with stride=2, each IARAM can provide higher bandwidth to support jump connection for the registers, as shown in Figure 4(b) . Therefore, only the necessary calculations are performed, which can achieve 4× speedup than the original convolution based on classic line buffer. Depthwise convolutional layer can be fused with its adjacent layers in a pipelined manner to speedup computation due to its less data-dependent property. With this insight, in this work, we introduce two types of cascaded PEs to the architecture of our accelerator, which can be summarized as follows.
• PE 33, PE DW. The results of 3×3 convolution can be sent to PE DW directly. Different from PE 33, PE DW are processing with line buffer to accommodate the continuous inflow of data. This manner works in conjunctions with our column-prior tiling strategy to reduce the consumption of registers, which we will present in section 4.3.
• PE 11, PE DW. Similarly, 1×1 convolution and depthwise convolution can also be processed in a fused manner. During computation, input activations are fetched from one of three input buffers, and the results of 1×1 convolution are sent to PE DW immediately and processed on the fly. The final results are written back to the corresponding output buffer.
As mentioned in section 2, activations and weights of the network are quantized to low bits. Specifically, the weights are quantized to power-of-two, which enables us to replace multipliers with shift operators. Compared with normal multiplications, it can reduce resource and power consumption. We conduct an experiment to verify the benefits of this shift-based multipliers, which shows that shift-based multi- Note that if we use multipliers, we have to use 4b/4b inputs in order to represent numbers from -4 to 4. Table 2 . Notation for tiling strategy and dataflow.
Variables
Descriptions 
Column-Prior Tiling Strategy
Under the limited on-chip resources, tiling is necessary to map convolutional layers to the accelerator. We adopt a column-prior tiling strategy, as shown in Figure 6 , which can reduce both latency and register consumption. We take a feature map with size 256×256 as an example, which is expected to be divided into two parts to fit into the limited on-chip buffers. As for the row-prior manner, a tile with size 128×256 is generated after 1×1 convolution and can be sent to PE DW immediately for the processing of depthwise convolution. In this situation, at least 2×256+3 = 515 registers are required for applying line buffer convolution. However, if the feature maps are divided into the size of 256×128 in a column-prior manner, only 2×128+3 = 259 registers are needed. Thus register consumption can be approximately halved. Similarly, invalid cycles caused by filling registers are also reduced, which will also be beneficial to latency and efficiency.
Since the feature maps are divided into several tiles by column index, overlapping between adjacent tiles are introduced. Suppose that we can obtain output tiles with five valid columns after 1×1 and depthwise convolution
Conv 1x1
Conv 3x3 DW Figure 6 . A particular case of column-prior tiling strategy applied to 1×1 and depthwise convolution (stride=1). The input feature maps are divided into three tiles and transferred from DDR to onchip BRAMs sequentially. As shown in the middle of the figure, extra features columns from adjacent tiles are necessary. (stride=1), the input tiles should contain seven valid values in each row. During the processing, a column of input features from the last tile is needed.
Hybrid Dataflow
Although column-prior tiling strategy is utilized for the efficiency of the accelerator, the on-chip buffer requirement and memory accesses depend heavily on the dataflow of computations [1, 2] . The output stationary, as well as the weight stationary, is the most commonly used dataflow in previous designs. Algorithm 1 and 2 illustrate both dataflows, respectively, where the parameters are shown in Table 2 .
• Output stationary dataflow. Input activations and weights are fed into the PE array continuously, and the partial sums are held in PEs until the final results are available. These final results are either passed to PE DW for the following computation or stored in the IARAMs/OARAM. Since each output is completed after weights in a filter have been calculated, higher bandwidth is required for weight transmission. In addition, because of the implementation of weight buffer, there are more opportunities for weights to be reused.
• Weight stationary dataflow. Each PE holds part of weights for reuse until finishing the computation with input activations in the corresponding channels. And the partial sums generated in each PE are stored to the Inter RAM. Only if the kernel group is completed can the final results be sent to IARAMs/OARAM. In this way, weights can be reused as many as possible, but the accelerator requires additional storage, i.e., Inter RAM.
Although our accelerator is specialized for compact detection network, different convolutional layers (1×1 convolution and depthwise convolution) still present heterogeneous property (e.g. width, height, and channel size). The dimensions of feature maps near to the input are relatively large. Thus these layers require more on-chip buffer to store the activations, while weights require less storage. In this case, there are more opportunities for weights to be reused, which is more suitable for output stationary dataflow. However, in the deeper layers, weights become much more intensive in memory, because output stationary dataflow needs to fetch all the weights of a kernel to the PE to calculate each output. If the weight buffer can not accommodate those weights, weights are required to be fetched multiple times during the processing, leading to more energy consumption. In other words, we need a larger weight buffer to reuse weights.
Therefore, we consider a hybrid dataflow that makes a balance between the weight reuse and weight buffer requirements to get the best performance and energy on the resource-limited computing platform. In most of the early layers, we adopt the output stationary dataflow. Thus, all the weights of a kernel group can be reused in weight buffer, and they are fetched from WRAM only once during the processing of a layer. The case becomes different as the network goes deeper, and the weight stationary dataflow is adopted. So the weight buffer requirement can be significantly reduced with only a small Inter RAM overhead.
With the help of Co-Processor, our accelerator is flexible enough to support these two types of dataflow according to the size of kernels. 
Experiments
We implement our solution on the Ultra96 development board with Xilinx Zynq UltraScale+ MPSoC. The accelerator runs at a frequency of 215 MHz with clock gating to each type of PE. Power measurement is obtained via a power monitor. We measured the power of approximate 6.9W on the Ultra96 when processing the detection task with the image size of 512 × 512. The configurations of each type of PE and the overall resource utilization are shown in Table 3 , in which we also list the supported precision of activations (A), weights (W), and outputs (O) respectively. It shows that less than 25% of the total on-chip DSPs are used on the FPGA since most of the multiplications are implemented as shift operations using LUTs. Most of the registers are used as weight buffer while BRAMs are mainly used for data buffer and the WRAM. With limited programmable resources on Ultra96 board, the whole system reaches an inference speed of 18 fps. Results are reported when the system is detecting objects from a video. Table 4 shows the specification of the entire system.
Although FPGA undertakes most of the computations in detection algorithm, we find that pre-processing and postprocessing on CPUs still account for most of the inference time, as shown in Figure 7 (a) . In order to overcome the bottleneck of CPU execution, we adopt a pipelined task management with multi-thread techniques. In this way, the total latency is reduced, and FPGA layers dominate most of the inference time, as shown in Figure 7 (b) .
Thread assignments are conducted empirically. Figure  7 (c) presents the detailed time breakdown of each layer. The latency can vary greatly depending on the input image because the number of objects within an image varies significantly and thus influence the computational complexity in the post-processing phase. Therefore, time breakdown in Figure 7 is obtained by averaging over a batch of images. As shown in the figure, the softmax layer is the most time-consuming among all the layers, while the data forward layer and visualization layer account for 34% of the latency. Note that in a real-world application such as ADAS, the detection results are used as part of the control system, in which visualization may not be necessary. In this situation, the latency of CPUs can be further reduced, pushing the system frame rate towards the maximum. Figure 8 shows a demo of our proposed object detection system. As we can see, the measured power is around 6.9W, and there are slight fluctuations as the detected image changes. Most of the targets are correctly detected (e.g. pedestrian, cars), frame rate for FPGA layers is around 25-30. As shown in Table 5 , we also compare our accelerator against previous works. Since the previous works are mainly designed for image classification, we also evaluate the performance of our customized MobileNet on ImageNet classification task. Compared with VGG ACC, which is implemented with 16-bits integers, our design can achieve better performance and accuracy even on a smaller FPGA. Low-Bit is implemented with lower bits, which leads to severe accuracy degradation. Synetgy uses shift operations to replace the spatial convolutions. It can achieve high accuracy with lower bits, i.e., 4-bits activations and 1-bit weights. However, our accelerator can achieve more stable performance with comparable accuracy.
Conclusion
In this paper, we present a system-level solution for object detection on the heterogeneous embedded system. We quantize the compact detection network to low bits, which allows us to replace multiplications with efficient shift operations. A dedicated CNN accelerator is implemented to carry out convolution computation. In order to support the arbitrary size of input feature maps under limited resources, we adopt a column-prior tiling strategy to map the convolutional layer to the accelerator. Compared to row-prior tiling strategy, it can reduce both register consumption and latency. According to the heterogeneous properties of different layers, we provide a hybrid dataflow, with which we can flexibly reuse the partial sums or filter weights. Multithread is also exploited to accelerate the pre-processing and post-processing. We believe that such an efficient and low energy system can play a role in IoT applications.
