PointNet on FPGA for Real-Time LiDAR Point Cloud Processing by Bai, Lin et al.
PointNet on FPGA for Real-Time LiDAR Point
Cloud Processing
Lin Bai, Yecheng Lyu, Xin Xu and Xinming Huang
Worcester Polytechnic Institute
Worcester, MA 01609, USA
{lbai2, ylyu, xxu10, xhuang}@wpi.edu
Abstract—LiDAR sensors have been widely used in many
autonomous vehicle modalities, such as perception, mapping,
and localization. This paper presents an FPGA-based deep
learning platform for real-time point cloud processing targeted
on autonomous vehicles. The software driver for the Velodyne
LiDAR sensor is modified and moved into the on-chip processor
system, while the programmable logic is designed as a cus-
tomized hardware accelerator. As the state-of-art deep learning
algorithm for point cloud processing, PointNet is successfully
implemented on the proposed FPGA platform. Targeted on a
Xilinx Zynq UltraScale+ MPSoC ZCU104 development board,
the FPGA implementations of PointNet achieve the computing
performance of 182.1 GOPS and 280.0 GOPS for classification
and segmentation respectively. The proposed design can support
an input up to 4096 points per frame. The processing time is
19.8 ms for classification and 34.6 ms for segmentation, which
meets the real-time requirement for most of the existing LiDAR
sensors.
I. INTRODUCTION
Nowadays, LiDAR plays an important role in autonomous
vehicle systems, due to its many advantages such as 3D infor-
mation capturing capability, no environment light requirement,
and etc. One or more LiDAR sensors are often installed on
an autonomous vehicle for the modalities of perception[1],
mapping[2], and localization[3]. One major challenge for a
LiDAR system is real-time point cloud processing.
In general, point cloud neural networks can be divided
into three subcategories: pixel-based approaches, voxel-based
approaches and 3D point-based approaches. Pixel-based meth-
ods project the 3D point cloud into 2D, either Bird Eye
View (BEV) [4] or front view[5]. Subsequently, deep neural
networks for 2D images can be applied directly. The voxel-
based methods partition the 3D space into voxel grid and
utilize neural networks to extract features from this grid. Both
aforementioned methods may lead to information loss. The
3D point-based method, however, directly takes the raw point
cloud as input in the form of (X,Y, Z, I). It does not need the
complicated statistics operations when comparing to the voxel-
based method. Meanwhile, it avoids too much information
loss when comparing to the pixel-based method. The 3D
point-based methods, such as PointNet[6], can produce much
higher accuracy and therefore have attracted lots of research
attentions.
In the paper, we propose an FPGA platform for PointNet
implementations. Because of the heterogeneous architecture,
Xilinx Zynq SoC chip can run the software driver for LiDAR
interface on its Processing System (PS) side, and put the
customized hardware accelerator on the Programmable Logic
(PL) side. Data transfer between PS and PL is via Direct
Memory Access (DMA).
II. RELATED WORK
Several previous works [7][8] were focused on accelerat-
ing matrix multiplication using one-dimension systolic array
on FPGA, which achieved efficient resource usage and low
bandwidth. Newly proposed architectures [9][10] for neu-
ral networks take advantage of Single Instruction Multiple
Data (SIMD) structure for matrix multiplication. Continental
AG released Assisted & Automated Driving Control Unit
(ADCU)[11], on which a Zynq UltraScale+ MPSoC chip was
loaded. It supports LiDAR processing but no technical details
were revealed. NVIDIA proposed DRIVE AGX self-driving
computer platforms built on Xavier SoC chip, which is capable
to process point cloud data received from a LiDAR. In [12]
and [13], the LiDAR was connected to a PC via Ethernet, and
after pre-processing on PC, feature maps were fed to a neural
network accelerator in an FPGA.
The contributions of this paper are summarized as follows:
1) To our knowledge, this is one of the first end-to-end
FPGA-based platforms for point cloud deep learning, via
Ethernet. A LiDAR is connected to the PS side directly.
After pre-processing by the LiDAR driver, point cloud is
stored in DDR memory that is accessible to the hardware
accelerator on PL side.
2) More specifically, PointNet has been implemented on
this platform as an example of point cloud deep learn-
ing algorithm. As the state-of-art deep neural network
for point cloud processing, PointNet is the backbone
of many latest works on 3D classifications and seg-
mentation. Based on this, one can easily extend our
implementation of PointNet accelerator to other neural
networks.
3) A scalable SIMD matrix multiplication architecture is
proposed, which is capable of processing matrix in
arbitrary size. This accelerator is able to process point
cloud with arbitrary number of points and generate
output in row order or column order. For an input of
4096 points per frame, the accelerator achieves the speed
of 50.5 and 28.9 frames per second for classification
and segmentation, respectively. Considering most of the
ar
X
iv
:2
00
6.
00
04
9v
1 
 [e
es
s.S
P]
  2
9 M
ay
 20
20
Fig. 1. PointNet for point cloud classification and segmentation
LiDAR scans are at 10Hz, this accelerator fulfills the
real time processing requirement.
The rest of the paper is organized as follows: The function
of LiDAR driver is described in Section III. After that, the
structure of PointNet is introduced in Section IV. Section V-
VI states the hardware optimization techniques and architec-
ture. The evaluation results and analysis are given in Section
VII. In the end, we conclude the paper in Section VIII.
III. POINT CLOUD PRE-PROCESSING
Before fed into PointNet, the raw data from LiDAR need
pre-processing. We modified the driver for the embedded ARM
processor on our FPGA platform. Pre-processing includes the
following operations:
1) Coordinate Transformation: LiDAR scans the physical
world in spherical coordinate, while the PointNet re-
quires input data in Cartesian coordinate.
2) Time Offset: The distance to an object is measured
by the time difference between emitting and receiving
optics after hitting the object. However, during this small
round trip time, LiDAR rotates an angle. This leads to a
time offset. LiDAR driver should compensate this time
difference.
IV. POINTNET
PointNet [6] is a state-of-art deep neural network algorithm
that was developed for point cloud classification and segmen-
tation. Unlike ordinary neural networks who adopt tensors
as input, PointNet’s input is a n × 3 matrix where n is the
number of points and 3 represents the position (X,Y, Z)
in Cartesian coordinate of one point. PointNet architecture
is shown in Fig. 1, where shared Multi-Layer Perceptron
(MLP) is 1× 1 convolution mathematically. So PointNet can
be realized by fully connected layers with branches. The
transformation structures are illustrated in Fig. 2 with input
matrix n×M , where M = 3 in case of input transform and
M = 64 in feature transform.
As one of the most well-known deep learning algorithms
for point cloud processing, PointNet is widely used as the
backbone of many state-of-the-art neural networks not only for
classification and segmentation, but also for object detection.
Fig. 2. Transform structures in PointNet, where n× 3 is for input transform
and n× 64 for feature transform
For instance, PointNet is used in PointFusion [14] and Atten-
tional PointNet [15] to extract point-wise feature and global
feature for object detection task. It also applies to STD [16]
and L3-Net [17].
V. OPTIMIZATION STRATEGY
A. Loop optimization
Fig. 3. Loop optimization for matrix multiplication
From mathematical point of view, 1 × 1 convolution is
equivalent to matrix multiplication as shown in Fig. 3, which
consists of 3 cascaded loops. To fully utilize the parallel
processing capability of an FPGA, these loops needs to be
optimized [10][18] for balancing the process time and resource
usage.
Loop-1: It depends on the storage of matrix. In PetaLinux
or in C code, the matrix is stored in row orientation. In order
to fetch data using DMA, it is not wise to unroll this loop.
Loop-2: Unrolling this loop determines how many times an
accelerator has to read the input feature map. Together with
loop-3, this is limited by the on-chip computation resources,
i.e. DSP slices on the FPGA in our case.
Loop-3: Unrolling this loop increases the throughput of the
accelerator. However, it is restricted by the communication
bandwidth of HP (high performance) interface between PS
and PL in Zynq. Partial unrolling of this loop leads to partial
sum so that intermediate buffer becomes necessary, which
can be merged into output buffer at the cost of high power
consumption owning to on-chip memory access.
As for the weight matrix obtained from training, it can be
pre-loaded into block RAM, so its storing and loading are
flexible.
B. Quantization
A quantization during training method described in [13] is
adopted in this design. Avoiding the modification of Tensor-
Flow source code, it supplies convenient solution for quanti-
zation. In this study, we quantizied PointNet parameters into
8-bit and 16-bit respectively.
VI. SYSTEM ARCHITECTURE OF POINTNET HARDWARE
ACCELERATOR
Based on the description in Section IV, all PointNet op-
erations can be categorized into either matrix multiplication
or max pooling. Therefore, the computing blocks involving
in the PointNet accelerator (Fig. 4) are Process Element (PE)
array for matrix multiplication, an adder array for partial sum,
and a comparator array for max pooling and ReLU (Rectified
Linear Unit). During inference, Batch Normalization (BN) is
absorbed into the PE. Concerning to the feature map storage,
double buffering technique is applied to both input buffer and
weights buffer to boost the throughput. The output buffer is
also designed as a two-stage buffer.
Fig. 4. Hardware architecture of PointNet accelerator
A. PE Array and Buffers
The PE consists of a multiplier array, an pipelined adder
tree and an adder array. According to our loop unrolling
method, loop 2 and 3 are both partially unrolled. Loop 2 partial
unrolling determines the number of multipliers and the size of
adder tree in each PE. Loop 3 partial unrolling factor is related
to the size of weight buffer and number of PEs in the array.
As indicated as 1© and 2© in Fig. 5, the PE array supports
both row-oriented output and column-oriented output by ap-
plying different reading patterns to the input and weight buffer.
Fig. 5. Matrix multiplication by block
Double buffering is designed for the input buffer and
the weight buffer, so that the imbalance between processing
throughput and HP port bandwidth is alleviated. Besides, a
2-stage output buffer is deployed right after the PE array. The
first stage is for partial sum during matrix multiplication. It has
wider bitwidth than the second stage. The second stage is for
storing the final results and transferring data to DDR via DMA.
This structure is designed for two reasons, one is to avoid
the precision reduction introduced by matrix partitioning, the
other one is to alleviate frequent reading of second stage output
buffer, so partial sum accumulation and data sending to DDR
can work simultaneously.
Fig. 6. Structure of the processing element and the buffers
B. Max-pooling and ReLU
In this PointNet accelerator, max-pooling and ReLU share
one comparator array. Max-pooling is to find out the largest
value in each feature that each column of the matrix. In order
to merge max-pooling into matrix multiplication pipeline, the
output pattern is charged from row-by-row to column-by-
column for the PE array. The ReLU function compares the
results with 0 to filter out the negative values. Besides ReLU,
other similar function like ReLU6 is also supported.
C. Operation Control
Prior to run, ARM core sends configurations to register
file block, including the number of points and its destination
buffer. The matrix multiplication patterns are also pre-defined
and loaded into register file, which determines how the result
comes out (row oriented or column oriented) and whether
the result will be sent to DDR or input buffer for the next
operation. According to the configurations read from register
file, a FSM (finite state machine) sends control signal to each
block. Double buffering enables this accelerator to accept new
weights or input during processing. The FSM also handles the
assignment of two buffers, one for receiving and the other for
sending. The usage of register file speeds up the processing.
By pre-loading all needed parameters into the accelerator, no
interrupt based configuration mechanism is necessary, which
avoids the slow down due to interrupt handling in PetaLinux.
VII. IMPLEMENTATION RESULTS
The accelerator is designed using Simulink and the HDL
Coder toolbox. The evaluation platform is Xilinx Zynq Ultra-
Scale+ MPSoC ZCU104 Development Kit. When operating in
64-bit mode, the maximum bandwidth of DDR is 102.4Gbps
at 800MHz. The LiDAR mounted on this platform is Velodyne
VLP-16.
Fig. 7 presents the test setup of the LiDAR processing
framework. PetaLinux operating system is running on the
ARM core on PS side. The point cloud is received via
Ethernet interface using UDP protocol. After processed by
the Velodyne driver, point cloud in Cartesian coordinate ROI
(region of interest) is transmitted into DDR memory. Then
ARM loads setting parameters for the hardware accelerator
through General Purpose (GP) port based on AXI-lite protocol.
During execution, the accelerator loads or stores point cloud
(or intermediate data) at DDR by DMA via High Performance
(HP) port according to AXI Stream protocol.
Fig. 7. Overview of the LiDAR processing framework
As described in the previous sections, the matrix mul-
tiplication pattern is pre-loaded into configuration memory.
Therefore, this design is able to implement the full PointNet
or PointNet-vanilla, which is a simplified PointNet without
transforms, for classification or segmentation task. The maxi-
mum number of points supported in this design is 4096. Larger
point cloud can be fed into this design after partitioning. For
autonomous driving applications, a Velodyne VLP-16 LiDAR
running at 10Hz supplies around 360/0.2×16 = 28.8K points
in each point cloud. Considering the normally used Region Of
Interest (ROI) is a 20m× 60m square in front of the vehicle
(less than 1/6), the points in ROI is less than 4096. For high
resolution LiDAR such as Velodyne HDL-64E, total number
of poin in ROI is much larger than 4096. To be fed into this
accelerator, the point cloud can be sub-sampled or partitioned.
Table I summarizes the on-chip resources consumption
when choosing matrix dimension size of M = 32 and N = 32
in Fig 5. In practical applications, users can choose the suitable
bitwidth based on available FPGA resource and processing
speed requirement.
TABLE I
FPGA RESOURCE CONSUMPTION OF POINTNET
Width LUT FF DSP BRAM URAM
INT8 19530 36010 1026 114 488% 8% 60% 37% 50%
INT16 30933 60412 1026 123 9613% 13% 60% 39% 100%
Tab. II compares the throughput and processing speed
in terms of different quantization bit width. The PointNet
accelerator takes 19.8 ms and 34.6 ms to classify and segment
a point cloud with 4096 points respectively when using INT8
quantization. Considering most of the LiDAR scans at 10Hz,
this PointNet accelerator is able to work in real time.
TABLE II
COMPARISON OF PERFORMANCE
Networks Throughput(GOPS) Processing time(ms)int8 int16 int8 int16
PointNet-vanilla 112.5 64.9 10.9 18.9classification
Point-classification 182.1 130.0 19.8 27.8
Point-segmentation 280.0 227.4 34.6 42.6
VIII. CONCLUSIONS
In this paper, a FPGA-based LiDAR processing platform is
proposed to accelerate point cloud deep learning algorithms.
More specifically, a scalable PointNet hardware accelerator
has been implemented on the FPGA SoC platform. For clas-
sification of an input frame with 4096 points, it only takes
19.8 ms reaching an estimated performance of about 182.1
GOPS. For segmentation task, it takes 34.6 ms per frame at the
performance of about 280 GOPS. In addition, the design leaves
some resource margin, so one can easily extend it for more
advanced detection neural networks such as PointFusion[14]
and Attentional PointNet[15], or other segmentation neural
networks like STD[16] and L3-Net[17].
ACKNOWLEDGMENT
This work was supported by the Mathworks Inc.
REFERENCES
[1] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
based 3d object detection,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 4490–4499, 2018.
[2] D. Droeschel and S. Behnke, “Efficient continuous-time slam for 3d
lidar-based online mapping,” in 2018 IEEE International Conference on
Robotics and Automation (ICRA), pp. 1–9, IEEE, 2018.
[3] H. Yin, Y. Wang, X. Ding, L. Tang, S. Huang, and R. Xiong, “3d
lidar-based global localization using siamese neural network,” IEEE
Transactions on Intelligent Transportation Systems, 2019.
[4] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint
3d proposal generation and object detection from view aggregation,”
in 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pp. 1–8, IEEE, 2018.
[5] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
detection network for autonomous driving,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1907–
1915, 2017.
[6] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
on point sets for 3d classification and segmentation,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 652–660, 2017.
[7] J. Shen, Y. Qiao, Y. Huang, M. Wen, and C. Zhang, “Towards a multi-
array architecture for accelerating large-scale matrix multiplication on
fpgas,” in 2018 IEEE International Symposium on Circuits and Systems
(ISCAS), pp. 1–5, 2018.
[8] G. Wu, Y. Dou, and M. Wang, “High performance and memory efficient
implementation of matrix multiplication on fpgas,” in 2010 International
Conference on Field-Programmable Technology, pp. 134–137, 2010.
[9] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp. 161–170, 2015.
[10] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pp. 26–
35, 2016.
[11] Continental AG, “Assisted & automated driving con-
trol unit.” https://www.continental-automotive.com/
en-gl/Landing-Pages/CAD/Automated-Driving/Enablers/
Assisted-Automated-Driving-Control-Unit.
[12] Y. Lyu, L. Bai, and X. Huang, “Real-time road segmentation using lidar
data processing on an fpga,” in 2018 IEEE International Symposium on
Circuits and Systems (ISCAS), pp. 1–5, IEEE, 2018.
[13] Y. Lyu, L. Bai, and X. Huang, “Chipnet: Real-time lidar processing for
drivable region segmentation on an fpga,” IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769–1779, 2018.
[14] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor fusion for
3d bounding box estimation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 244–253, 2018.
[15] A. Paigwar, O. Erkent, C. Wolf, and C. Laugier, “Attentional pointnet
for 3d-object detection in point clouds,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
2019.
[16] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d
object detector for point cloud,” in Proceedings of the IEEE International
Conference on Computer Vision, 2019.
[17] W. Lu, Y. Zhou, G. Wan, S. Hou, and S. Song, “L3-net: Towards
learning based lidar localization for autonomous driving,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6389–6398, 2019.
[18] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop oper-
ation and dataflow in fpga acceleration of deep convolutional neural
networks,” in Proceedings of the 2017 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, pp. 45–54, 2017.
