A Scalable Pipelined Dataflow Accelerator for Object Region Proposals on
  FPGA Platform by Fu, Wenzhi et al.
ar
X
iv
:1
81
0.
12
13
7v
1 
 [c
s.D
C]
  2
6 O
ct 
20
18
A Scalable Pipelined Dataflow Accelerator for Object Region
Proposals on FPGA Platform
∗
Wenzhi Fu 1,3, Jianlei Yang 1,3, Pengcheng Dai 2,3, Yiran Chen 4 and Weisheng Zhao 2,3
1 School of Computer Science and Engineering, Beihang University, Beijing, 100191, China.
2 School of Electronic and Information Engineering, Beihang University, Beijing, 100191, China.
3 Fert Beijing Research Institute, BDBC, Beihang University, Beijing, 100191, China.
4 Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA.
{jianlei, weisheng.zhao}@buaa.edu.cn
ABSTRACT
Region proposal is critical for object detection while it usually
poses a bottleneck in improving the computation efficiency
on traditional control-flow architectures. We have observed
region proposal tasks are potentially suitable for performing
pipelined parallelism by exploiting dataflow driven accelera-
tion. In this paper, a scalable pipelined dataflow accelerator
is proposed for efficient region proposals on FPGA platform.
The accelerator processes image data by a streaming manner
with three sequential stages: resizing, kernel computing and
sorting. First, Ping-Pong cache strategy is adopted for rota-
tion loading in resize module to guarantee continuous output
streaming. Then, a multiple pipelines architecture with tiered
memory is utilized in kernel computing module to complete
the main computation tasks. Finally, a bubble-pushing heap
sort method is exploited in sorting module to find the top-k
largest candidates efficiently. Our design is implemented with
high level synthesis on FPGA platforms, and experimental re-
sults on VOC2007 datasets show that it could achieve about
3.67X speedups than traditional desktop CPU platform and
>250X energy efficiency improvement than embedded ARM
platform.
Keywords
Scalable pipeline, Dataflow accelerator, Region proposal, FPGA
platform, Streaming processing
1. INTRODUCTION
Recent years, deep neural networks have made a great suc-
cess in image classification tasks of single object [1]. How-
ever, it can not be directly applied to multi-object detection,
which is much more practical in real-world applications [2, 3].
Object detection first decides the interested regions which po-
tentially include objects (so-called region proposal) and then
performs image classification on these proposed regions. Ef-
ficient real-time object detection is a prerequisite for many
kinds of unmanned systems since the energy efficiency of in-
tensive computation is critical for them. There have been
many research works focused on hardware acceleration for im-
∗This work was supported in part by the National Natural
Science Foundation of China (61602022, 61501013, 61571023,
61521091 and 1157040329), National Key Technology Pro-
gram of China (No. 2017ZX01032101) and the 111 Talent
Program B16001.
age classification [4] but no one designed for region proposals,
which have posed a bottleneck in efficient object detection.
Recently several computationally intensive approaches have
shown a better performance [5], however, it is not practical for
them running on embedded platform with limited resources
and restricted power consumption. On the other hand, bi-
narized normed gradients (BING) method is proposed as an
effective approach for region proposal generation [6, 7, 8, 9],
which achieves state-of-the-art detection rate. However, even
though there have already been several techniques to opti-
mize the implementation of BING algorithm on CPU[6], due
to the control-flow processing style of Von Neumann architec-
ture, it is still difficult for BING to run efficiently especially
in embedded platform.
In this work, a scalable dataflow accelerator is proposed for
the BING algorithm who has been reformed to a dataflow-
driven manner. With the memory hierarchy, this efficiency of
the streaming processing can be highly improved which over-
comes the shortages of the traditional platform. Furthermore,
this accelerator has a good scalability to be scaled to a larger
parallelism efficiently without the problem of synchronization.
The main contributions of this work are listed as following:
(1) The BING algorithm is reformed as a dataflow-driven
manner to obtain significant benefits from dataflow pro-
cessing.
(2) A dataflow architecture is proposed for the BING algo-
rithm, which generate a continues input stream then pro-
cess it in a streaming manner to fully deployed the locality
and minimized the intermediate data amount.
2. REGION PROPOSAL WITH BING
The aim of region proposal is to find the interested regions
which potentially contain objects with minimized windows,
which is important in multi-object detection tasks. Previous
works [6, 7] have shown that a simple 8 × 8 feature by com-
puting the normed gradients could be adopted as an efficient
region proposal. After binarizing the normed gradients fea-
ture, BING could achieve efficient objectness estimation. The
original image is first resized to different sizes with different
preset resizing ratios, therefore, the region proposal candi-
dates with different resolution in the original image can be
represented with 8 × 8 window uniformly in the resized im-
ages. Then, the SVM (Support Vector Machine) stage I is
adopted to evaluate the confidence for each region proposal.
resize
module
kernel computing module
Grad NMS
heap
sort-I
heap
sort-IISVM-II
post.
SVM-I
img.
(a) Accelerator framework
Per. 0 Per. 1 Per. 2
Grad SVM-I NMS
Grad SVM-I
Grad
ĊĊ
 
batch
bank
batch order
b
a
n
k o
rd
e
r
esize 
odule
ernel omputing odule
rad NMS
eap
ort
eap
ort IISVM II
prop
SVM
img
(b) Pipelined kernel computing driven by resized im-
ages
Figure 1: Demonstration of our proposed accelerator.
Following that, the NMS (non-maximum suppression) stage
is utilized to reduce the overlap among the proposals. After
that, top-n largest candidates are selected from the region
proposals corresponding to each resized image and the SVM
stage II is performed to evaluate the confidence among all
of the resized images. Finally, the top-k largest candidates
could be obtained as final proposals by a sorting stage. Since
the BING algorithm has few branch or jump operations, it
could be abstracted as a data-driven algorithm which is ideal
to perform dataflow driven acceleration.
3. PROPOSED DATAFLOW ACCELERATOR
3.1 Accelerator Framework
Corresponding to the computation process of BING algo-
rithm, the framework of the proposed dataflow accelerator
shown in Fig. 1(a) can also be divided into resizing module,
kernel computing module and sorting module. The function
of each module is same as the corresponding computation in
the algorithm but can be completed in a different manner. As
shown in Fig. 1(a), the resizing module first resize the original
image with preset ratios then partition the resized image into
a series of batch, which represent for four neighbor pixels ver-
tically. The following kernel computing module works in the
granularity of batch and can be divided into three stages: Cal-
cGrad, SVM and NMS operations, which can process the con-
tinuously batch streaming with a parallel pipelines architec-
ture and exports a stream of candidates. The sorting module
is deigned to obtain the top-k or top-n candidates by perform-
ing bubble-pushing heap sort strategy to satisfy the through-
put requirements. Finally, the region proposals could be ob-
tained after post processing. Without loss of generality, the
kernel computing module is demonstrated by four pipelines
in this paper which could be extended as more pipelines as
following. And the bubble-pushing heap sort model is very
similar to [10] which is too trivial to be listed here.
3.2 Resizing Module
A naive resizing approach is carried out in [11] but it cannot
satisfy our requirements for streaming processing. The kernel
computing module requires a continuous batch streaming out-
put from resizing module to make sure the multiple pipelines
are fully loaded. In this work, the original image is parti-
tioned into four blocks uniformly as shown in Fig. 2. Only
one port of the configured BRAMs is assigned for each block
part1
part2
part3
part4
worker
I
worker
II
worker
III
worker
IV
1
57
2
58
Ă Ă
block0
3
59
4
60
ĂĂ
block1
5
61
6
62
ĂĂ
block2
7
63
8
64
Ă Ă
block3
49
1
17
33
5
21
37
5351
3
19
35
7
23
39
55
65
81
69
85
67
83
71
87
resized image
531 7
2117 19 23
3533 37 39
535149 55
29
5
13
27
3
19
25
17
9
1
11
21
7
15
23
31
61
37
45
53
39
47
55
6359
35
51
43
57
49
41
33
30
6
14
28
4
20
26
18
10
2
12
22
8
16
24
32
62
38
46
54
40
48
56
6460
36
52
44
58
50
42
34
77
69 71
7975
67
73
65
78
70 72
8076
68
74
66
85 878381 86 888482
1st cycle
4th cycle
3rd cycle
2nd cycle
Figure 2: Illustration of resizing module.
c
a
c
h
e
 l
a
n
e
 0
worker
I
load
1
load
17
load
33
load
49
worker
II
load
19
load
35
load
51
load
3
worker
III
load
37
load
53
load
5
load
21
worker
IV
load
55
load
7
load
23
load
39
export
65
81
97
113
export
67
83
99
115
export
69
85
101 
117
export
129 
145
161 
177
worker
I
worker
II
worker
III
worker
IV
load
129
load
145
load
161
load
177
load
147
load
163
load
179
load
131
load
165
load
181
load
133
load
149
load
183
load
135
load
151
load
167
export
  1
17
33
49
export
  3
19
35
51
export
   5
21
37
53
export
  7
23
39
53
worker
I
worker
II
worker
III
worker
IV
load
83
load
99
load
115
load
67
load
101
load
117
load
69
load
85
load
119
load
71
load
87
load
103
load
65
load
81
load
97
load
113
worker
I
worker
II
worker
III
worker
IV
load
193
load
211
load
229
load
247
export
71
87 
103
119
c
a
c
h
e
 l
a
n
e
 1
Figure 3: Continuous batch streaming with Ping-Pong cache.
while two dual-port or four single-port BRAM are required
for processing each image. The pixels from four blocks are
fetched in parallel as processed by four workers. Since the
fetched pixels from different blocks are discontinuous, a Ping-
Pong cache, which consists of two cache lanes, is adopted here
for buffering. Meanwhile, the cache is also partitioned as four
parts which correspond to the four BRAM ports. The fetch-
ing procedure loads the pixels in different blocks by a rotation
style and feeds them to different part of cache. As shown in
Fig. 3, two groups of workers on two cache lanes could alter-
nately provide continuous batch streaming with Ping-Pong
cache strategy.
3.3 Kernel Computing Module
The kernel computing module includes CalcGrad, SVM-I
and NMS stage. First, we define a distance in RGB color
space between pixel Pa and Pb as
D (Pa, Pb) = max
q∈{R,G,B}
|Pa (q)− Pb (q)|
The gradients on vertical direction and horizon direction are
defined as Ix (i, j) and Iy (i, j), respectively, where i and j
represents the pixel location. They could be obtained by cal-
culating the normed gradient between neighbor pixels
Ix (i, j) = D
(
P(i−1,j), P(i+1,j)
)
Iy (i, j) = D
(
P(i,j−1), P(i,j+1)
)
and the gradient of each pixel G (i, j) could be calculated by
G (i, j) = min {Ix (i, j) + Iy (i, j) , 255}
The obtained normed gradient of each pixel is reformulated
as a two-dimension array.
In the SVM-I stage, the normed gradients G8×8 of every
8×8 window are formed by the gradientsG1×8 of each row and
reshaped as a 64-dimension feature with a row-wise manner.
Then the trained SVM weightsWSVM are adopted to perform
the classification and determine the evaluation scores of each
window
s = G8×8 ·WSVM
And all the scores s compose a two-dimensional array S. Dur-
ing the NMS stage, the max score max5×5 for each 5× 5 block
of S is determined by finding the max score max1×5 for each
row first and then maximum of them. For all 5× 5 blocks,
only the window corresponding to max5×5 will be selected for
windows sequence output.
A rough approach to perform CalcGrad, SVM-I and NMS
tasks usually requires a temporal two-dimensional array for
intermediate data buffering which can result in waste of re-
sources and computation cycles. In our design, these stages
are reformulated and delivered on multiple pipelines for stream-
ing processing as shown in Fig. 4. Each of the three stages
has its own workspace with its own pipelines which performs
operations in a streaming manner, and can be connected seri-
ally for processing batch streaming continuously. Meanwhile,
the data locality in each workspace is exploited by the tiered
cache, which is built with memory window and line buffer[12],
by caching all the required data for batch pass synchroniza-
tion.
Since the pipelines behave similarly, the SVM-I stage is
taken for pipelines illustration as an example without loss of
generality. As shown in Fig. 4, the input batch streaming can
be processed continuously by the pipelines and consequently
generate a streaming output for the following stage. During
the processing of each workspace, the data are continuously
loaded from tiered cache for the calculation of G8×8, while
the calculation results are restored to the tiered cache for the
next operation simultaneously.
At the end of the kernel computing module, the NMS oper-
ation usually results in non-continuous output streaming. In
this work, a FIFO structure is adopted as streaming buffer
to make sure the pipelines run smoothly which could improve
the total efficiency of the proposed accelerator.
4. EXPERIMENTAL RESULTS
4.1 Experiment Setup
The proposed accelerator is implemented by C language and
synthesized with Xilinx Vivado HLS 2017 [13] on two target
chips: Artix-7 (low voltage version) @ 3.3MHz for always-on
& low power application, and Kintex UltraScale+ @ 100MHz
for real-time & high performance application. A carefully
quantization strategy is adopted to specify various bit-width
for different data storage purpose. The synthesized resources
utilizations are illustrated in Table 1. The power consumption
and system latency is obtained by C-RTL co-simulation in
Vivado. The VOC2007 datasets [14] are adopted to evaluate
the quality of proposed windows on the metrics by detection
rate (DR v.s. #WIN), and mean average best overlap (MABO
v.s. #WIN), where #WIN is the number of given proposals.
The metric DR v.s. #WIN means the detection rate (DR)
for the given #WIN proposals [6]. And MABO v.s. #WIN
means the mean average best overlap for the given #WIN
proposals [7]. Hence, a larger DR or MABO value means a
better proposal quality.
4.2 Performance Evaluation
A defined IoU (intersection-over-union) parameter in the
previous works [7] is also adopted in our evaluation. It is
a scoring function to measure the affinity of two bounding
batch 0
batch 1
batch 2
calc 
Grad8x8
Grad
calc 
Grad1x8
calc
s srad
calc 
Grad1x8
calc 
Grad8x8
calc
s sGrad
calc 
rad1x8
calc 
rad8x8
calc
s srad
calc 
Grad1x8
calc 
Grad8x8
calc
s s
tiered  cache
calc 
Grad8x8
Grad calc 
Grad1x8
calc
s srad
calc 
Grad1x8
calc 
Grad8x8
calc
s sGrad
calc 
rad1x8
calc 
rad8x8
calc
s srad
calc 
Grad1x8
calc 
Grad8x8
calc
s s
tiered  cache
calc 
Grad8x8
Grad calc Grad1x8
calc
s srad
calc 
Grad1x8
calc 
rad8x8
calc
s sGrad
calc 
rad1x8
calc 
Grad8x8
calc
s sGrad
calc 
Grad1x8
calc 
rad8x8
calc
s s
tiered  cache
Pipeline diagram of SVM-I workspace
pixel
memory 
window 
of Iy
tiered 
cache of
Ix
calc Iy
calc Ix
tiered cache 
of pixel
calc G
Grad
CalcGrad workspace
s
memory 
window 
of S
tiered cache 
of max1x5
s
calc
max1x5
calc
max5x5
select
NMS workspace
…
to sort module
line buffer
slide
slide
slide
from resize module
batch pass
batch pass
batch pass
calc
s
calc
G8x8
calc
G1x8
Grad
memory 
window 
of G1x8
tiered cache 
of G8x8
s
SVM-I workspace
 Demonstration of kernel computing module
pipeline
diagram
Figure 4: Demonstration of kernel computing module with
the diagram of SVM-I pipeline.
100 101 102 103
Number of object proposal windows (#WIN)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
D
R
 /
 M
A
B
O
Proposals Quality Comparison
BING-DR
BING-MABO
FPGA-DR
FPGA-MABO
Figure 5: Quality evaluation of windows proposals by com-
paring BING [6] and our proposed FPGA accelerator.
boxes, defined by the intersection area of two bounding boxes
divided by their union. The DR and MABO metrics are mea-
sured by varying the IoU overlap threshold which is set as 0.4
as default for correct detection.
Table 1: The FPGA resources utilizations comparison be-
tween two target devices: Artix-7 and Kintex UltraScale+.
Resources
Artix-7 Low Volt.†
@ 3.3MHz
Kintex UltraScale+∗
@ 100MHz
Available Utilized Available Utilized
LUT 63400 54453 162720 56504
LUT-RAM 19000 4166 99840 3157
FF 126800 48611 325440 50079
BRAM 135 135 360 146
DSP 240 25 1368 25
BUF-G - - 256 8
† Targeted on Artix-7 (low voltage) xc7a100tlftg256-2L.
∗ Targeted on Kintex UltraScale+ xcku3p-ffva676-3-e.
The detection accuracy of our proposed accelerator is com-
pared with BING [6] as shown in Fig. 5. The BING algorithm
generates 5000 object windows to evaluate the accuracy. How-
ever, we have observed it only achieves less than 3% accuracy
improvement compared with only 1000 object windows gen-
erated. Hence, only 1000 object windows are proposed in our
design when considering the significantly increasing require-
ments on hardware resources. For evaluating 1000 propos-
als, the detection rate of our proposed approach (denoted as
FPGA-DR) is about 94.72% while BING is about 97.63%.
Table 2: Speedup and power efficiency compared with Intel
i7 and ARM platforms.
Kintex UltraScale+ Artix-7 Low Volt.
Speedup
Power
efficiency
Speedup
Power
efficiency
Intel i7 3.67X >220X 0.12X 66X
ARM A53 68X >250X 2.2X >60X
Table 3: Performance evaluation between targeted on Artix-7
and Kintex UltraScale+, where Ptot is the total power con-
sumption which includes static power and dynamic power con-
sumption, and Pdyn is the dynamic power consumption.
Artix-7 Low Volt.
@ 3.3MHz
Kintex UltraScale+
@ 100MHz
Ptot Pdyn Speed Ptot Pdyn Speed
97mW 15mW 35fps 821mW 350mW 1100fps
The implemented FPGA accelerators are compared with
two conventional CPU platforms. The BING algorithm is
well-optimized and could achieve a proposal speed by 300fps
on Intel i7-3940XM CPU (TDP: 55W) platform with multi-
threaded programming and subword parallelism techniques
[6]. We also evaluate BING on a Raspberry-Pi 3B plat-
form with 64-bit ARM A53 processor, which could achieve
16fps speed and 3W∼4W [15] power consumption. However,
the proposed accelerator on Kintex UltraScale+ target could
achieve a proposal speed by 1100fps while the power con-
sumption is only 821mW when running at 100MHz, which
is applicable for real-time processing of multi-camera sen-
sor fusion applications. Hence, it could achieve about 3.67X
speedups and>220X energy efficiency improvement compared
with BING on Intel i7 CPU. Furthermore, the proposed ac-
celerator targeted on low voltage Artix-7 could achieve 35fps
with an extremely low power consumption 97mW when run-
ning at 3.3MHz, which is attractive for ultra-low power ap-
plications with always-on working mode whose speed is also
sufficient for most of the applications. The detailed compar-
ison results of our proposed FPGA accelerators against the
two CPU platforms are shown in Table 2.
5. CONCLUSIONS
Efficient region proposal generation is a critical task for ob-
ject detection. A scalable dataflow accelerator is proposed
in this paper for efficient region proposals on FPGA plat-
form. The proposal procedures in BING algorithm are re-
formulated as a dataflow driven manner and implemented on
multiple pipelines architecture. All of the modules in the ac-
celerator are organized by streaming processing mechanisms
so that these streaming data could guarantee the pipelines are
fully loaded. Furthermore, a tiered cache system is utilized to
improve the bandwidth of data synchronization between dif-
ferent processing stages. Evaluations on VOC2007 datasets
show that our proposed accelerator achieves a very large scale
of speedups and energy efficiency improvement with a little
detection rate degradation.
6. REFERENCES
[1] Alex Krizhevsky et al. Imagenet classification with deep
convolutional neural networks. In Advances in NIPS, pages
1097–1105, 2012.
[2] Ziming Zhang et al. Group membership prediction. In
Proceedings of the IEEE ICCV, pages 3916–3924, 2015.
[3] Gregory Castan˜o´n et al. Efficient activity retrieval through
semantic graph queries. In Proceedings of the 23rd ACM
Multimedia, pages 391–400, 2015.
[4] Joel Emer et al. Tutorial on hardware architectures for deep
neural networks. http://eyeriss.mit.edu/tutorial.html, 2016.
[5] Wei Liu et al. Ssd: Single shot multibox detector. In
Proceedings of ECCV, pages 21–37, 2016.
[6] Ming-Ming Cheng et al. BING: Binarized normed gradients
for objectness estimation at 300fps. In Proceedings of the
IEEE CVPR, pages 3286–3293, 2014.
[7] Ziming Zhang et al. Sequential optimization for efficient
high-quality object proposal generation. IEEE TPAMI, 2017.
[8] Yunchao Wei et al. Hcp: A flexible cnn framework for
multi-label image classification. IEEE TPAMI,
38(9):1901–1907, 2016.
[9] Shengxin Zha et al. Exploiting image-trained cnn
architectures for unconstrained video classification. BMVC,
2015.
[10] Wojciech M Zabo lotny. Dual port memory based heapsort
implementation for FPGA. Proceedings of SPIE, 2011.
[11] Nitish Kumar Srivastava et al. Accelerating Face Detection
on Programmable SoC Using C-Based Synthesis. In
Proceedings of ACM/SIGDA FPGA, pages 195–200, 2017.
[12] Fernando Martinez Vallina. Implementing memory structures
for video processing in the Vivado HLS tool. XAPP793 (v1.
0), September, 20, 2012.
[13] Xilinx Inc. Vivado HLx 2017.
https://www.xilinx.com/products/design-tools/vivado.html.
[14] Mark Everingham et al. The PASCAL Visual Object Classes
Challenge 2007 (VOC2007) Results.
http://www.pascalnetwork.org/challenges/VO-
C/voc2007/workshop/index.html.
[15] Raspberry Pi Power Consumption Benchmarks .
https://www.pidramble.com/wiki/benchmarks/power-
consumption.
