RoadNet-RT: High Throughput CNN Architecture and SoC Design for
  Real-Time Road Segmentation by Bai, Lin et al.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
RoadNet-RT: High Throughput CNN Architecture
and SoC Design for Real-Time Road Segmentation
Lin Bai, Student Member, IEEE, Yecheng Lyu, Student Member, IEEE, and Xinming Huang, Senior Member, IEEE
Abstract—In recent years, convolutional neural network has
gained popularity in many engineering applications especially
for computer vision. In order to achieve better performance,
often more complex structures and advanced operations are
incorporated into the neural networks, which results very long
inference time. For time-critical tasks such as autonomous driving
and virtual reality, real-time processing is fundamental. In order
to reach real-time process speed, a light-weight, high-throughput
CNN architecture namely RoadNet-RT is proposed for road
segmentation in this paper. It achieves 90.33% MaxF score
on test set of KITTI road segmentation task and 8 ms per
frame when running on GTX 1080 GPU. Comparing to the
state-of-the-art network, RoadNet-RT speeds up the inference
time by a factor of 20 at the cost of only 6.2% accuracy
loss. For hardware design optimization, several techniques such
as depthwise separable convolution and non-uniformed kernel
size convolution are customized designed to further reduce
the processing time. The proposed CNN architecture has been
successfully implemented on an FPGA ZCU102 MPSoC platform
that achieves the computation capability of 83.05 GOPS. The
system throughput reaches 327.9 frames per second with image
size 1216×176.
Index Terms—road segmentation, real-time, FPGA, neural
network.
I. INTRODUCTION
NOWADAYS autonomous vehicles have become one ofthe most promising technologies. Especially after the
boosting development of Convolutional Neural Networks
(CNNs), the perception capabilities of autonomous vehicles
have been pushed into extremely high accuracy, such as
vehicles or pedestrians detection [1][2], depth completion [3],
road segmentation [4][5] and object tracking [6]. However,
most of the high accuracy networks are very deep and have
a great number of redundant parameters. Even running on the
state-of-the-art GPUs, very few of them are able to work in
real-time. This prevents their applications to time-critical tasks
like autonomous driving. Therefore, a fast light-weight CNN
with reasonable accuracy is valuable to those time-critical
applications.
The road segmentation task, as one of the fundamental
tasks for autonomous driving, tells the vehicles where is the
possible way to drive. This task has been well-solved by a lot
of researchers concerning to the accuracy. While as a time-
critical task, only 3 of the existed methods are able to process
in real-time (as illustrated in Fig. 1, where red line indicates
the real-time boarder) and none of their throughput is higher
L. Bai, Y. Lyu and X. Huang are with the Department of Electrical
and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA,
01609 USA e-mail: {lbai2,ylyu,xhuang}@wpi.edu.
Manuscript received
Fig. 1: Processing speed v.s. accuracy on the KITTI road
segmentation test dataset. Red star indicates our method, and
colored dots represent other methods. All of these solutions
are tested on GPU/CPU which listed in KITTI leader-board
of road segmentation task. Red line is the border of real-time.
than 40 fps. Especially, as a fundamental module prior to
planning and controlling, road segmentation is expected to
process one image even faster than 30 ms to guarantee the
real-time response of autonomous driving systems. Thus how
to segment the drivable region in a extremely short time while
maintaining an acceptable accuracy is urgent to bridge the gap
between the academic research and industrial practice.
In this paper, we proposed RoadNet-RT, a real-time road
segmentation network, which is able to run in real-time
on GPUs. Besides, we have summarized some optimization
techniques aiming to convert ordinary CNN structures into
hardware-friendly ones. As an example, RoadNet-RT has
been implemented using these techniques and achieved real-
time processing as well. The contributions of this paper are
summarized as following:
• A light-weight high throughput CNN named RoadNet-
RT is proposed, whose segmentation accuracy is 90.33%
on KITTI road segmentation leaderboard. Through ex-
tracting features by two branches, one shallow branch
for spatial information and one deep branch for context
information, its inference time on NVIDIA GTX 1080 is
8 ms. When comparing to the state-of-the-art RBANet[4],
this network achieves 1/20 inference time, with only 6.2%
loss in accuracy.
• Considering how to convert an ordinary segmentation
CNN into hardware friendly one (computation and band-
width efficient), we make some experiments and sum-
ar
X
iv
:2
00
6.
07
64
4v
1 
 [e
es
s.I
V]
  1
3 J
un
 20
20
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
marized some guidelines quantitatively. As examples,
how to employ depthwise separable convolution, how
to deal with convolutions with different kernel size and
dilated convolution, whether using batch normalization
are analyzed.
• A corresponding hardware accelerator has been imple-
mented on Xilinx ZCU102 MPSoC platform. By bal-
ancing the bandwidth and computation capability, this
accelerator can process 83.05 Giga Operation Per Second
(GOPS), equaling to 327.9 frame per second (fps).
The rest of the paper is organized as following: Sec. II
summarizes the existing research on road segmentation, real-
time segmentation CNNs and existing FPGA implementation
for segmentation CNNs. In Sec. III, the structure of the
proposed segmentation network is described together with
its training details. The guidelines talking about FPGA-CNN
co-optimization are placed in Sec. IV. In the following two
sections, detailed hardware architecture and its performance
are discussed. Sec. VII concludes the entire paper.
II. RELATED WORK
A. Road segmentation
Lots of research efforts have been paid on road segmentation
task in KITTI. The RBANet proposed in [4] adopted the clas-
sical encoder-decoder structure. Instead of using the direct skip
connection in U-Net [7] and SegNet [8], a residual refinement
module bridged encoder and decoder parts, which consisted of
reversed attention and boundary attention mechanisms. So that
high resolution spatial details were preserved for decoding.
Atrous Spatial Pyramid Pooling (ASPP) module was also
utilized in RBANet. For images size 360 × 720 running on
GTX Titan XP, the processing time was 0.16 second per frame.
In [9], SSLGAN served to train unlabeled data and enhanced
road feature representations using a discriminator from GAN.
Labeled data contain many redundant areas, so training both
labeled and unlabeled data prevents the overfitting problem
and accelerates the convergence speed. Its processing speed
was 0.7s per frame on TITAN X. A road and road boundary
detection network (RBNet) was proposed in [5]. Based on a
Bayesian network, RBNet could simultaneously estimate the
probabilities of a pixel on the image belonging to the road and
road boundary so that the road and road boundary detection
were combined into a single process. It was able to process
each frame in 0.18s on Tesla K20c (5 GB). StixelNet [10]
posed generic static obstacles represented as stixels and learnt
directly using a CNN. StixelNet II [10] was a unified network
with real-time detection capability for both categorized and un-
categorized objects. This network performed well on column-
based obstacle detection and road segmentation but was not
sensitive to the distinction of road boundaries. MultiNet [11]
utilized the same encoder which was based on VGG16 to
supply features to different decoders for classification, seg-
mentation and detection tasks. In segmentation decoder, the
low resolution segmentation feature map was convoluted and
then upsampled using transposed convolution. It was claimed
that MultiNet could perform inference at 23 fps. The structure
of Up-Conv-Poly [12] was very similar to U-Net. It achieved
MaxF score 93.83%. For images with size 500× 500, this
network could process each frame within 83 ms on TITAN X
GPU.
Other CNN based road segmentation algorithms such as
DEEP-DIG [13] and MAP [14] generated a precise drivable
region but required heavy computational power.
In our previous work RoadNetV3 [15], we introduced Long-
Short Term Memory (LSTM) to help finding the contour of
the road. It extracted features via a FCN-like encoder. After
that, several convolutional-LSTM layers followed to predict
the contours of drivable region. It achieved 93.08% in accuracy
but 300 ms per frame.
B. Real-time segmentation
In recent years, some researchers have shifted their focus
to real-time segmentation tasks. Their solutions are generally
categorized into two groups (Fig. 2), one is encoder-decoder
network and the another one is multi-branch network.
(a) (b)
Fig. 2: The mainstream structures for real-time semantic seg-
mentation. (a) illustrates the u-shape encoder-decoder structure
and (b) demonstrates the multi-branch structure
FPENet [16] adopted the encoder-decoder structure. By
using a feature pyramid encoding block to encode multi-
scale contextual features with depthwise dilated convolutions
in all stages and a mutual embedding upsample module as
decoder, FPENet efficiently aggregated of high-level semantic
features and low-level spatial details. Through introducing an
efficient spatial pyramid (ESP), ESPNet [17] brought great
improvement in both speed and performance. In its improved
version, ESPNet-V2 [18] further enlarged the receptive field
and reduced the calculation of parameters. In [19], DAB-
Net balanced the efficiency and accuracy via stacking light-
weighted blocks with different dilation rates. DFANet [20]
aggregated multi-scale features from different layers to gain
higher accuracy in spatial details. The light-weight backbone
of DFANet guaranteed its real-time processing speed.
ContextNet [21] proposed the solution of multi-branch
structure for the first time. A deep but low resolution net-
work extracted the context information. And a shallow but
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
high resolution network focused on detailed spatial informa-
tion. BiSeNet [22] inherited the solution of ContextNet and
improved the feature fusion modules by creating attention
residual module and feature fusion module. Via adding global
pooling layer and residual layer, BiSeNet outperformed Con-
textNet. In ICNet [23], the authors borrowed the image pyra-
mid thinking from PSPNet [24]. One more branch was added
to acquire more spatial details. Plus the label guided training
for each branch, ICNet had better accuracy than BiSeNet but
longer processing time. BiSeNet-V2 [25] improved the first
version by replacing feature fusion module into aggregation
module and using Seg Head to guide the loss of each feature
extractor layer. Other networks like LBN-AA [26], CANet [27]
also used similar structure.
Solutions other than the two mentioned above also represent
good results. FarSee-Net [28] applied Cascaded Factorized
Atrous Spatial Pyramid Pooling (CF-ASPP) at the end of
feature extraction layers to guarantee enough spatial infor-
mation was captured. What’s more, to reduce the number
of operations, sub-pixel convolution was deployed, so that
FarSee-Net accepted low resolution input and generated high
resolution output.
C. FPGA implementation of segmentation
To accelerate the inference speed, a great amount of effort
focused on FPGA implementation of segmentation neural
networks. The key of hardware accelerator for CNNs was
the trade-off between bandwidth and computation capability.
U-Net [7] and FCN [29] are both implemented in [30].
By utilizing convolution plus board removing method, this
accelerator operated transposed convolution efficiently. Its
performance was 107 GOPS and supported up to 17 fps for
512×512 images. A straight-forward fully convolution neural
network for segmentation has been proposed and implemented
on FPGA [31][32]. Without changing the channel depth for
each layer and skip connections used in U-Net [7], this ac-
celerator pushed its performance to process 79.4 fps for input
size 64×180×14. Liu merged the convolution and transposed
convolution into one vector multiplication unit and fused all
intermediate feature maps in on-chip memory [33]. And the
FPGA implementation reached 1578 GOPS, which was 57
fps for 256×256×3 images. Another hardware architecture
combining the convolution and transposed convolution oper-
ations was proposed in [34]. Its computation capability were
151.5 GOPS and 94.3 GOPS for convolution and transposed
convolution respectively. besides, a 3D segmentation CNN
accelerator was implemented in [35].
III. PROPOSED NETWORK
The proposed road segmentation network is inspired by
ContextNet [21], BiSeNet [22] and ICNet [23]. It consists of
two branches for context information and spatial information
extraction respectively, as shown in Fig 3.
The context path is a deep network aiming to learn the
context information, which consists of an input convolutional
layer and two residual modules from ResNet18 [36]. After
this, the extracted features are fed into ASPP layer expecting
Fig. 3: Real-time road segmentation network structure
to concatenate the features from different fields of perception
(dilated rates are 2, 4, 8 and 16). In the end the attention
refinement module (ARM) from [22] is introduced to refine the
context information. In ARM (Fig. 4a), global average pooling
layer together with 1×1 convolutional layer extracts context
feature and then their results refine the context features.
Considering that context path does not have to focus on spatial
details, therefore, to further reduce the number of operations,
the input image is shrunken by a factor of 0.5 before fed into
context path.
For spatial path, which focuses on spatial details of the input
images, contains only three convolution layers. To enhance
its capability of noticing details, no image resize is applied
here. The context and spatial branches are fused in a residual
refinement way, called Feature Fusion Module (FFM) [22]
(Fig. 4b). The residual is the product of input feature map
and its global attention path, including global average pooling
layer, 1×1 convolutional layer, activation layers (ReLU and
Sigmoid). At the end of the network, to reproduce the output
with the same size as input, the output of FFM is upsampled
8 times by the bi-linear resize algorithm.
The number of channel is chosen to be factor of 64. This is
based on the number of parallelism the hardware accelerator
could support, in order to maximize the efficiency of it.
A. Training Details
This road segmentation network is implemented using Keras
and trained from scratch on a single GeForce GTX 1080 GPU.
All the convolutional layers were initialized using the Xavier
uniform initializer [37]. During training, the batch size is set to
32. The Adam optimizer works with learning rate 1e-3. When
in plateau, a reduction rate of 0.8 is applied to the learning
rate. A hybrid loss function combining Dice loss and Focal
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
(a) (b)
Fig. 4: (a) structure of FFM, (b) structure of ARM [22]
loss is deployed here expecting to balance the positive and
negative samples.
Data augmentation for training includes random horizontal
flip, Gaussian noise adding, random brightness contrast, ran-
dom blurring, etc.
B. Dataset and Evaluation
The dataset for training and evaluation is the KITTI road
segmentation dataset, which contains 289 training images and
290 testing images. The training image size ranges from
370×1224 to 375×1242. The evaluation job is done by an on-
line evaluation server supplied by KITTI. The evaluation (Tab.I
is divided into Urban Unmarked (UU), Urban Marked (UM)
and Urban Multiple Marked lanes (UMM). URBAN ROAD
is the comprehensive evaluation of the above three.
Concerning to the speed, if running on GeForce GTX 1080
GPU, this network could process each image (1216×176)
within 8 ms. Four samples of predictions are demonstrated in
front view and bird eye view by Fig. 5 and Fig. 6 respectively,
where green area represents the overlap between prediction
and ground truth, red area is road in ground truth but not
correctly predicted by our network, and blue area is not road
but recognized as road by our network.
IV. NETWORK OPTIMIZATION FOR HARDWARE
In this section, we summarize some guidelines to optimize
specific CNNs toward FPGAs accelerator implementation. So
that on-chip resources efficiency and computation efficiency
FPGA design are maximized. Different from the conventional
optimization techniques, the goal of this step is to balance the
number of operations, number of weights and computation
patterns, while remaining the accuracy within a reasonable
range.
A. Depthwise Separable Convolution
Depthwise separable convolution is initially introduced in
[38]. And it has been widely adopted to a great number
of light weighted neural networks such as Xception [39],
MobileNet series [40][41]. The main idea of depthwise sepa-
rable convolution is to decompose standard convolution into a
3×3 depthwise convolution and a 1×1 pointwise convolution
to achieve less number of weights and consequently less
operations. Fig. 7 illustrates how the depthwise separable
convolution works, where DK is the size of convolution
kernel, M is the depth of input feature maps and N is the
number of convolution kernels (also the channel number of
output feature maps).
During depthwise convolution, a single filer is applied
to each input channel. And then the pointwise convolution
applies a 1×1 convolution to combine the outputs of the
depthwise convolution. The number of weights required by
standard convolution and depthwise separable convolution are
calculated in (1) and (2) respectively.
DK ·DK ·M ·N (1)
DK ·DK ·M +M ·N (2)
Therefore, when replacing standard convolution with depth-
wise separable convolution, the reduction ratio of weights is
DK ·DK ·M +M ·N
DK ·DK ·M ·N =
1
N
+
1
D2K
(3)
Fig. 5: Road segmentation results in camera view
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Fig. 6: Road segmentation results in Bird-Eye View
TABLE I: Performance evaluation from KITTI online test server
Benchmark MaxF AP PRE REC FPR FNR
UM ROAD 88.16 % 90.24 % 87.67 % 88.65 % 5.68 % 11.35 %
UMM ROAD 93.04 % 94.20 % 92.16 % 93.94 % 8.78 % 6.06 %
UU ROAD 88.21 % 89.61 % 87.73 % 88.70 % 4.04 % 11.30 %
URBAN ROAD 90.33 % 91.63 % 89.55 % 91.13 % 5.86 % 8.87 %
(a) Standard convolution
(b) depthwise convolution
(c) pointwise convolution
Fig. 7: The comparison between standard convolution in (a)
and depthwise separable convolution with depthwise part in
(b) and pointwise part in (c)
Besides the parameter reduction and operation number
decreasing, from the hardware implementation point of view,
depthwise separable convolution need not as large size ac-
cumulator as required by standard convolution. In standard
convolution, every element of output feature map is the sum
of DK · DK · M elements. While in depthwise separable
convolution, that is the sum of DK ·DK and M elements for
depthwise convolution and pointwise convolution respectively.
Less bit-width accumulator leads to less critical paths and
consequently increases the running clock frequency of FPGA.
Applying this to RoadNet-RT proposed in this paper, the
total number of parameters is reduced from 89.93% to 87.63%,
which is illustrated in Tab. II. Although the accuracy loss is
2.3%, the number of parameters reduces by a factor of 5.64.
TABLE II: Comparison of RoadNet-RT with and without
depthwise separable convolution
Convolution type IOU1 parameters
Standard 89.93% 756,032
Depthwise separable 87.63% 133,870
1Since KITTI online test sever limits the submission to be 3 times per
month, therefore 20% of the training set has been split as validation set to
evaluate the methods we proposed. Here we choose IOU as the main metric
to estimate the performance of different methods. IOU is one of the most
important and the most widely used metrics for segmentation performance
evaluation.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
B. Large kernel size convolution
The most commonly used kernel size for convolution is
3×3. However, in order to have large size of field of percep-
tion, especially in the first layer, large kernel size is usually
desired (7×7 in ResNet[36] for instance).
Algorithm 1 Cascaded loop of standard convolution
for no in Nof do . output channel,loop-4
for (y,x) in (Noy,Nox) do . feature map,loop-3
for ni in Nif do . input channel,loop-2
for (ky,kx) in (K,K) do . kernel,loop-1
Fout[no,y,x]+=
Fin[ni,y-ky,x-kx] *K[no,ni,ky,kx]
Fout += bias[no]
However, to deal with different kernel size filters affects
either parallelism of processing or the efficiency of buffer
usage. From matrix multiplication point of view (in Alg. 1),
through keeping the loop-1, hardware accelerator can handle
different size of filters without extra multipliers consumed. But
the penalty is the parallelism of loop-1 loss. However, different
size of filter requires different size of on-chip memory. Con-
sider a feature map with size W ·H ·C, to buffer it for K ·K
filter, memory size (W+K−1)·(H+K−1)·C is need. So that
the feature map buffer for 7×7 filter is 4 ·(W+H+4)/(W ·H)
times larger than that for 3×3 filter.
To pursue the same perceptive field of 7×7, three cascaded
convolutional layers with kernel size 3 × 3 can replace one
convolutional layer with kernel size 7 × 7. If so, there is no
extra resource needed including both multipliers and memory.
Besides, the number of operations decreases. As illustrated
in Fig. 8, for input feature map size W · H · Ci and output
feature map size W · H · Co, if 7×7 filter is applied, totally
(W ·H·7·7×Ci·Co) = 49·W ·H·Ci·Co GOPS costs. In case
of three 3× 3 convolutional layers, 3· (W ·H· 3· 3·Ci·Co)) =
27·W ·H · Ci · Co.
Fig. 8: Strategy for large convolutional layer replacement
The performance comparison between these two options
mentioned above is shown in Tab. III. When replacing the
first convolutional layer (7×7) with three 3×3 convolutional
layers, the accuracy loss in IOU is 0.19%. Since there is
only one layer of 7×7 convolution, the save in operations and
parameters are negligible.
In the segmentation networks, dilated convolution [42]
is the most widely used method to enlarge the perceptive
field without introducing more weights. Unfortunately, during
convolution with dilated kernel (3×3 with dilated rate equals 3
for instance), the region required from feature map is still 7×7.
This will introduce the dilemma described above still. The only
TABLE III: Comparison between 7×7 convolution and its
replacement (Ci is the input feature map channel number and
Co is the output feature map channel number, they equal 32
and 64 respectively in this experiment)
Method IOU parameter
1 conv 7× 7 89.93% 7·7·Ci ·Co
3 conv 3× 3 89.74% 3·3·Ci ·Co+3 ·3·Co ·Co + 3·3·Co ·Co
difference is, if using three 3×3 convolutional layers instead
of one dilated 3×3 convolutional layers with dilated rate as
3, two times more weights and two times more operations are
unavoidable. However, since the dilated convolutional layer
usually won’t dominant, this penalty is still affordable.
TABLE IV: Performance comparison between dilated convo-
lution (3× 3 with dilated rate 3) and its replacement
Method IOU parameter
1 conv 3× 3 89.93% 3·3·Ci·Codilated rate 3
3 conv 3× 3 89.78% 3·3·Ci ·Co+3 ·3·Co ·Co + 3·3·Co ·Co
C. Consideration of channel depth
In our hardware implementation, after considering the given
resources on ZCU102 board, loop-2 in Alg. 1 has been un-
rolled with 32 feature maps processed in parallel. To maximum
the computation efficiency of accelerator, it’s better that the
input feature map depth of all layers align to integer factor of
32.
D. Batch Normalization
During inference, Batch Normalization (BN) is downgraded
into 1×1 convolution and further merged into convolutional
layer prior than it. The merged weights and bias follow (4) and
(5), where W and b represent weights and bias respectively.
Wmerge = WBN ·Wconv (4)
Wmerge = WBN · bconv + bBN (5)
Batch normalization layer is helpful for fast convergence
but not always a necessary layer concerning to the accuracy
(PointNet[43] for instance). The contribution of BN layer is
evaluated in Tab.V, from which we find in our segmentation
neural network, BN helps to increase the accuracy by 1.05%
without too much difference in convergence. Therefore, BN
layers are kept in RoadNet-RT.
TABLE V: The performance comparison with and without BN
layer, both of them are trained using the same batch size and
the sam GPU
Method with BN without BN
IOU 89.93% 88.88%
converge@epoch 350 344
duration/epoch 9s 5s
Some experiments declared that BN after ReLU usually
shows better result [44]. But this may vary from one network
to another.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
E. Quantization
To maximize the computation capability of FPGA, fixed
point operations is preferred. Quantization aware training has
been performed for 8-bit and 16-bit respectively with the
help of model optimization library from QKeras[45]. Brute-
force quantization may lead to unacceptable precision loss.
While quantization aware training restrict the bit-width during
training. This not only compensates the precision loss but
introduces more non-linearity.
The performance after quantization is shown in Tab. VI. We
found the accuracy (IOU) of 8-bit is 86.85%, while that of 16-
bit quantization is 87.08%. The accuracy of 16-bit quantization
has 0.23% higher than that of 8-bit quantization, but we list
the considerations when choosing which one to implement on
hardware, 1) from storage perspective, memory space for 8-bit
weights is only half of that for 16-bit quantization, 2) from
hardware resources perspective, each DSP48E2 core could
perform two 8-bit multiplications simultaneously but only one
for 16-bit multiplication [46].
TABLE VI: Performance of 8-bit and 16-bit quantized net-
works
Bit Width IOU size of parameters
8-bit 86.85% 0.127MB
16-bit 87.08% 0.255MB
V. SYSTEM-ON-CHIP IMPLEMENTATION
To fully utilize the computation resources, the whole system
is partitioned into software part (done by ARM processor)
and hardware part (running on FPGA). The software part job
is image resize for both input and output of neural network
(Fig. 3). With the help of OpenCV library [47], image resize
can be easily done on PYNQ platform.
Fig. 9: Strategy for large convolutional layer replacement
The overview of hardware architecture is demonstrated in
Fig. 9. It consists of depthwise convolution module, and
pointwise convolution module, feature map buffers, weights
buffers. A finite state machine controls the running order
of CNN operations. All the modules mentioned above are
configurable based on the on-chip resources available on the
target FPGA platform.
A. Depthwise convolution module
Depthwise convolution module contains line buffers, pro-
cess engines (PEs) and adder trees. As descried in the previous
section, to unroll the kernel loop (loop-1 in Alg. 1), line buffer
is needed to generate the sliding patch. Since kernel size
of all the convolutional layers in this segmentation network
is 3 × 3, a multiplier array with length=9 follows the line
buffer. Correspondingly, an adder tree in the end sums the
products up. To balance the computation efficiency and on-
chip resources, the batch size of depthwise convolution module
is set to 32.
Fig. 10: Block diagram of depthwise convolution module
B. Pointwise convolution module
To align to the depthwise convolution module to fit the same
size of feature buffers, the pointwise convolution module is
designed to handle 32×1 vector - 32×32 matrix multiplication.
There are 3 components multiplier array, adder tree, and ReLU
module form the Pointwise convolution module. If the batch
normalization layer is placed before ReLU layer, it can be
merged and completed by multiplier array and adder tree.
Otherwise, 1 extra multiplier and 1 extra adder is necessary
to perform the batch normalization operation.
Fig. 11: Block diagram of pointwise convolution module
C. ARM Module and FFM Module
Both ARM and FFM modules require operations with to-
tally different computation patterns. Global average pooling is
to calculate the average value of one entire channel. Therefore,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
an accumulator plus one multiplier for each channel has been
implemented. The following 1 × 1 convolution is mathe-
matically vector-matrix multiplication, which can be either
routed into pointwise convolutional module or implemented
with extra resource, given the resource consumption of this
operation is small. Sigmoid function is approximated by the
piece-wise function and implemented using a Look-Up Table.
D. Buffers
The on-chip memory are divided into buffers for feature
maps, weights and global pooling result respectively. In this
design, 1) there is no biases, so that no extra buffer is needed
for bias storage, and 2) since the weights occupy only small
portion of the on-chip memory, so that they can be hard coded
into on-chip memory.
To boost the processing speed, one effective way is to
reduce the number of time data transmission (between FPGA
and DDR memory). Multiple feature map buffers with size
22×152×32 have been implemented as ping-pong buffers to
decrease data swap as much as possible.
E. Tasks on ARM Processor
Referring to Fig. 9, the entire CNN is implemented on
FPGA side. The left task is images resize at the input and
output of CNN respectively. Two threads of ARM processor
are utilized to do input image resize and output image resize
respectively.
Fig. 12: Strategy for large convolutional layer replacement
VI. RESULTS AND DISCUSSION
The implementation tools used in this paper are Xilinx
Vivado HLS and MATLAB HDL Coder Toolbox. The whole
system has been implemented on ZCU102 development kit,
with the PYNQ system installed (The system setup is show in
Fig. 13). There are 548,160 Flip-Flops (FFs), 274,080 Look-
Up Tables (LUTs), 1824 (32.1 Mb) Block RAMs (BRAMs)
and 2,520 DSPs on the board. The FPGA resources consump-
tion of this accelerator for both 16-bit and 8-bit quantization
formats are shown in Tab. VII.
Fig. 13: Setup of road segmentation system
TABLE VII: FPGA on-chip resource usage of the road seg-
mentation network
bitwidth FF LUT DSP BRAM
8-bit 113067 257204 1560 1057
16-bit 115158 260616 1560 1222
Since each DSP48E2 slice can handle two 8-bit ×8-bit
multiplication while the number for 16-bit number is one,
thus 8-bit format accelerator consumes almost the same DSP
slices and BRAMs as that in 16-bit format but twice the
number of input images. When running at 200 MHz, this
16-bit version accelerator is capable to process 327.9 fps. In
case of 8-bit, the processing speed is doubled to 655.8 fps.
In Tab. VIII, all the image-based road segmentation solutions
in the KITTI leader-board are summarized and compared to
our solution in GPU and FPGA. Most of the existing methods
cost 100 ms or longer. One of the only two real-time solutions
FCN-LC [48] runs on TITAN X GPU, which requires 600-
650W power supply on PC to support. Therefore, our solutions
supply a well-balanced and practical way to run this the road
segmentation task on embedded devices.
In this accelerator (16-bit version for instance), there are 8
feature map buffers are allocated. But this number may vary
according to the balance between available resources on the
target platform and required processing speed. More feature
map buffers can store more intermediate feature maps and
consequently increase the processing speed. While less feature
map buffers require more temporary data stored in external
memory rather than on-chip ones. And thus leads to longer
processing time.
VII. CONCLUSION
This paper presents a real-time, high-throughput convo-
lutional neural network architecture for road segmentation.
Several optimization techniques are applied to reduce the
number of operations while preserving the accuracy perfor-
mance. This networks achieves 90.33% F1 score with 125
fps on GTX 1080 GPU (for image size 1216 × 176). More
importantly, using RoadNet-RT as an example, we present a
systematic approach on how to perform network optimization
for hardware implementation. Following this as a guideline,
one can easily convert any existing network structure into
an computation efficient, high-through architecture for FPGA
with little or none accuracy loss. Several experiments have
been conducted to support the proposed approach. In the
end, a SoC design has been successfully demonstrated on
ZCU102 development kit, which achieves a speeds up of the
processing time saving by a factor of 2.6 comparing to its
GPU implementation.
REFERENCES
[1] X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline for
3d detection of vehicles,” in 2018 IEEE International Conference on
Robotics and Automation (ICRA). IEEE, 2018, pp. 3194–3200.
[2] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d
object detector for point cloud,” in Proceedings of the IEEE International
Conference on Computer Vision, 2019, pp. 1951–1960.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE VIII: Performance comparison of all the image-based road segmentation solutions in the KITTI leaderboard (blank
means it is not mentioned in the original paper)
Name CNN-based Input shape Devices Accuracy(MaxF) Processing speed
RBANet[4] 3 360× 720 TITAN XP 96.30% 160 ms
SSLGAN[9] 3 375× 1242 TITAN X 95.53% 700 ms
RBNet[5] 3 300× 900 Tesla K20c 94.97% 180 ms
StixelNet-II[10] 3 800× 370 Quadro M6000 94.88% 1200 ms
MultiNet[11] 3 1248× 384 94.88% 170 ms
RoadNet3[15] 3 600× 160× 5 GTX 950M 94.44% 300 ms
DEEP-DIG[13] 3 Titan X 93.98% 140 ms
Up-Conv-Poly[12] 3 500× 500 TITAN X 93.83% 83 ms
OFA-Net[49] 3 93.74% 40 ms
Up-Conv[12] 3 300× 300 GTX TITAN X 92.39% 52.2 ms
ALO-AVG-MM[50] 3 624× 192 GTX 1080 92.03% 29.6 ms
FTP[14] 3 91.61% 280 ms
PT-ResNet[51] 3 GTX 1080 Ti 91.61% 300 ms
FCN-LC[48] 3 621× 187 TITAN X 90.79% 30 ms
StixelNet[52] 3 24× 370 89.12% 1000 ms
MAP[14] 3 87.80% 280 ms
SPRAY[53] 3 800× 600 GTX 580 87.09% 45 ms
multi-task CNN[54] 3 375× 1242 unknown type GPU 86.81% 25.1 ms
PGM-ARS[55] 3 ∼ 75× 248 Intel i7-4700MQ processor 85.69% 50 ms
SRF[56] 7 500× 250 82.44% 200 ms
ARSL-AMI[57] 7 80.36% 50 ms
CN[58] 7 79.02% 2000 ms
Ours (floating point) 3 176× 1216 GTX 1080 90.33% 8 ms
[3] X. Cheng, P. Wang, C. Guan, and R. Yang, “Cspn++: Learning context
and resource aware convolutional spatial propagation networks for depth
completion,” arXiv preprint arXiv:1911.05377, 2019.
[4] J.-Y. Sun, S.-W. Kim, S.-W. Lee, Y.-W. Kim, and S.-J. Ko, “Reverse and
boundary attention network for road segmentation,” in Proceedings of the
IEEE International Conference on Computer Vision Workshops, 2019, pp.
0–0.
[5] Z. Chen and Z. Chen, “Rbnet: A deep neural network for unified road
and road boundary detection,” in International Conference on Neural
Information Processing. Springer, 2017, pp. 677–687.
[6] W. Choi, “Near-online multi-target tracking with aggregated local flow
descriptor,” in Proceedings of the IEEE international conference on
computer vision, 2015, pp. 3029–3037.
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention. Springer,
2015, pp. 234–241.
[8] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-
volutional encoder-decoder architecture for image segmentation,” IEEE
transactions on pattern analysis and machine intelligence, vol. 39, no. 12,
pp. 2481–2495, 2017.
[9] X. Han, J. Lu, C. Zhao, S. You, and H. Li, “Semisupervised and weakly
supervised road detection based on generative adversarial networks,”
IEEE Signal Processing Letters, vol. 25, no. 4, pp. 551–555, 2018.
[10] N. Garnett, S. Silberstein, S. Oron, E. Fetaya, U. Verner, A. Ayash,
V. Goldner, R. Cohen, K. Horn, and D. Levi, “Real-time category-based
and general obstacle detection for autonomous driving,” in Proceedings
of the IEEE International Conference on Computer Vision, 2017, pp.
198–205.
[11] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun,
“Multinet: Real-time joint semantic reasoning for autonomous driving,”
in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp.
1013–1020.
[12] O. G. Leivas, W. Burgard, and T. Brox, “Efficient deep methods for
monocular road segmentation,” in IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2016), 2016.
[13] J. Munoz-Bulnes, C. Fernandez, I. Parra, D. Ferna´ndez-Llorca, and
M. A. Sotelo, “Deep fully convolutional networks with random data
augmentation for enhanced generalization in road detection,” in 2017
IEEE 20th International Conference on Intelligent Transportation Systems
(ITSC). IEEE, 2017, pp. 366–371.
[14] A. Laddha, M. K. Kocamaz, L. E. Navarro-Serment, and M. Hebert,
“Map-supervised road detection,” in 2016 IEEE Intelligent Vehicles
Symposium (IV). IEEE, 2016, pp. 118–123.
[15] Y. Lyu, L. Bai, and X. Huang, “Road segmentation using cnn and
distributed lstm,” in 2019 IEEE International Symposium on Circuits and
Systems (ISCAS). IEEE, 2019, pp. 1–5.
[16] M. Liu and H. Yin, “Feature pyramid encoding network for real-time
semantic segmentation,” in British Machine Vision Conference 2018,
BMVC, 2019.
[17] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi,
“Espnet: Efficient spatial pyramid of dilated convolutions for semantic
segmentation,” in Proceedings of the european conference on computer
vision (ECCV), 2018, pp. 552–568.
[18] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “Espnetv2: A
light-weight, power efficient, and general purpose convolutional neural
network,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 9190–9200.
[19] G. Li and J. Kim, “Dabnet: Depth-wise asymmetric bottleneck for real-
time semantic segmentation,” in British Machine Vision Conference 2018,
BMVC, 2019.
[20] H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation
for real-time semantic segmentation,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2019, pp. 9522–9531.
[21] R. P. K. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet:
Exploring context and detail for semantic segmentation in real-time,” in
British Machine Vision Conference 2018, BMVC, 2018, p. 146.
[22] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet:
Bilateral segmentation network for real-time semantic segmentation,” in
Proceedings of the European conference on computer vision (ECCV),
2018, pp. 325–341.
[23] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic
segmentation on high-resolution images,” in Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp. 405–420.
[24] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 2881–2890.
[25] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet
v2: Bilateral network with guided aggregation for real-time semantic
segmentation,” arXiv preprint arXiv:2004.02147, 2020.
[26] G. Dong, Y. Yan, C. Shen, and H. Wang, “Real-time high-performance
semantic image segmentation of urban street scenes,” IEEE Transactions
on Intelligent Transportation Systems, pp. 1–17, 2020.
[27] Q. Tang, F. Liu, J. Jiang, and Y. Zhang, “Attention-guided
chained context aggregation for semantic segmentation,” arXiv preprint
arXiv:2002.12041, 2020.
[28] Z. Zhang and K. Zhang, “Farsee-net: Real-time semantic segmentation
by efficient multi-scale context aggregation and feature space super-
resolution,” arXiv, pp. arXiv–2003, 2020.
[29] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440.
[30] S. Liu, H. Fan, X. Niu, H.-c. Ng, Y. Chu, and W. Luk, “Optimizing
cnn-based segmentation with deeply customized convolutional and decon-
volutional architectures on fpga,” ACM Transactions on Reconfigurable
Technology and Systems (TRETS), vol. 11, no. 3, pp. 1–22, 2018.
[31] Y. Lyu, L. Bai, and X. Huang, “Real-time road segmentation using lidar
data processing on an fpga,” in 2018 IEEE International Symposium on
Circuits and Systems (ISCAS). IEEE, 2018, pp. 1–5.
[32] ——, “Chipnet: Real-time lidar processing for drivable region segmen-
tation on an fpga,” IEEE Transactions on Circuits and Systems I: Regular
Papers, vol. 66, no. 5, pp. 1769–1779, 2018.
[33] S. Liu and W. Luk, “Towards an efficient accelerator for dnn-based
remote sensing image segmentation on fpgas,” in 2019 29th Interna-
tional Conference on Field Programmable Logic and Applications (FPL).
IEEE, 2019, pp. 187–193.
[34] L. Bai, Y. Lyu, and X. Huang, “A unified hardware architecture for
convolutions and deconvolutions in cnn,” in 2020 IEEE International
Symposium on Circuits and Systems (ISCAS). IEEE, 2020, pp. 1–5.
[35] J. Shen, D. Wang, Y. Huang, M. Wen, and C. Zhang, “Scale-out
acceleration for 3d cnn-based lung nodule segmentation on a multi-
fpga system,” in Proceedings of the 56th Annual Design Automation
Conference 2019, 2019, pp. 1–6.
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[37] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proceedings of the thirteenth
international conference on artificial intelligence and statistics, 2010, pp.
249–256.
[38] L. Sifre and S. Mallat, “Rigid-motion scattering for image classification,
2014,” Ph.D. dissertation, Ph. D. thesis, 2014.
[39] F. Chollet, “Xception: Deep learning with depthwise separable convo-
lutions,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 1251–1258.
[40] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
lutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[41] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.
[42] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[43] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on
point sets for 3d classification and segmentation,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
652–660.
[44] BatchNorm after ReLU, 2016 (accessed May 3, 2020). [Online]. Avail-
able: https://github.com/gcr/torch-residual-networks/issues/5
[45] “Qkeras: a quantization deep learning library for keras,” https://github.
com/google/qkeras, accessed: 2019-12-06.
[46] Y. Fu, E. Wu, A. Sirasao, A. Attia, K. Khan, and R. Wittig, “Deep
learning with int8 optimization on xilinx devices,” White Paper WP486,
Xilinx, 2017.
[47] OpenCV, “Open source computer vision library,” 2015.
[48] C. C. T. Mendes, V. Fre´mont, and D. F. Wolf, “Exploiting fully
convolutional neural networks for fast road detection,” in 2016 IEEE
International Conference on Robotics and Automation (ICRA). IEEE,
2016, pp. 3174–3179.
[49] S. Zhang, Z. Zhang, L. Sun, and W. Qin, “One for all: A mutual
enhancement method for object detection and semantic segmentation,”
Applied Sciences, vol. 10, no. 1, p. 13, 2020.
[50] F. A. Reis, R. Almeida, E. Kijak, S. Malinowski, S. J. F. Guimara˜es,
and Z. K. do Patrocı´nio, “Combining convolutional side-outputs for road
image segmentation,” in 2019 International Joint Conference on Neural
Networks (IJCNN). IEEE, 2019, pp. 1–8.
[51] R. Fan, Y. Wang, L. Qiao, R. Yao, P. Han, W. Zhang, I. Pitas, and
M. Liu, “Pt-resnet: Perspective transformation-based residual network for
semantic road image segmentation,” arXiv preprint arXiv:1910.13055,
2019.
[52] D. Levi, N. Garnett, E. Fetaya, and I. Herzlyia, “Stixelnet: A deep
convolutional network for obstacle detection and road segmentation.” in
BMVC, 2015, pp. 109–1.
[53] T. Ku¨hnl, F. Kummert, and J. Fritsch, “Spatial ray features for real-
time ego-lane extraction,” in 2012 15th International IEEE Conference
on Intelligent Transportation Systems. IEEE, 2012, pp. 288–293.
[54] M. Oeljeklaus, F. Hoffmann, and T. Bertram, “A fast multi-task cnn
for spatial understanding of traffic scenes,” in 2018 21st International
Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018,
pp. 2825–2830.
[55] M. Passani, J. J. Yebes, and L. M. Bergasa, “Fast pixelwise road
inference based on uniformly reweighted belief propagation,” in 2015
IEEE Intelligent Vehicles Symposium (IV). IEEE, 2015, pp. 519–524.
[56] L. Xiao, B. Dai, D. Liu, D. Zhao, and T. Wu, “Monocular road detec-
tion using structured random forest,” International Journal of Advanced
Robotic Systems, vol. 13, no. 3, p. 101, 2016.
[57] M. Passani, J. J. Yebes, and L. M. Bergasa, “Crf-based semantic labeling
in miniaturized road scenes,” in 17th International IEEE Conference on
Intelligent Transportation Systems (ITSC). IEEE, 2014, pp. 1902–1903.
[58] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene
segmentation from a single image,” in European Conference on Computer
Vision. Springer, 2012, pp. 376–389.
