26,008 research outputs found
Multi-Modal 3D Object Detection in Autonomous Driving: a Survey
In the past few years, we have witnessed rapid development of autonomous
driving. However, achieving full autonomy remains a daunting task due to the
complex and dynamic driving environment. As a result, self-driving cars are
equipped with a suite of sensors to conduct robust and accurate environment
perception. As the number and type of sensors keep increasing, combining them
for better perception is becoming a natural trend. So far, there has been no
indepth review that focuses on multi-sensor fusion based perception. To bridge
this gap and motivate future research, this survey devotes to review recent
fusion-based 3D detection deep learning models that leverage multiple sensor
data sources, especially cameras and LiDARs. In this survey, we first introduce
the background of popular sensors for autonomous cars, including their common
data representations as well as object detection networks developed for each
type of sensor data. Next, we discuss some popular datasets for multi-modal 3D
object detection, with a special focus on the sensor data included in each
dataset. Then we present in-depth reviews of recent multi-modal 3D detection
networks by considering the following three aspects of the fusion: fusion
location, fusion data representation, and fusion granularity. After a detailed
review, we discuss open challenges and point out possible solutions. We hope
that our detailed review can help researchers to embark investigations in the
area of multi-modal 3D object detection
PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module
LIDAR point clouds and RGB-images are both extremely essential for 3D object
detection. So many state-of-the-art 3D detection algorithms dedicate in fusing
these two types of data effectively. However, their fusion methods based on
Birds Eye View (BEV) or voxel format are not accurate. In this paper, we
propose a novel fusion approach named Point-based Attentive Cont-conv
Fusion(PACF) module, which fuses multi-sensor features directly on 3D points.
Except for continuous convolution, we additionally add a Point-Pooling and an
Attentive Aggregation to make the fused features more expressive. Moreover,
based on the PACF module, we propose a 3D multi-sensor multi-task network
called Pointcloud-Image RCNN(PI-RCNN as brief), which handles the image
segmentation and 3D object detection tasks. PI-RCNN employs a segmentation
sub-network to extract full-resolution semantic feature maps from images and
then fuses the multi-sensor features via powerful PACF module. Beneficial from
the effectiveness of the PACF module and the expressive semantic features from
the segmentation module, PI-RCNN can improve much in 3D object detection. We
demonstrate the effectiveness of the PACF module and PI-RCNN on the KITTI 3D
Detection benchmark, and our method can achieve state-of-the-art on the metric
of 3D AP.Comment: 8 pages, 5 figure
MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving
3D object detection is a significant task for autonomous driving. Recently
with the progress of vision transformers, the 2D object detection problem is
being treated with the set-to-set loss. Inspired by these approaches on 2D
object detection and an approach for multi-view 3D object detection DETR3D, we
propose MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer architecture to
fuse image and LiDAR features to improve the detection accuracy. Our end-to-end
single-stage, anchor-free and NMS-free network takes in multi-view images and
LiDAR point clouds and predicts 3D bounding boxes. Firstly, we link the object
queries learnt from data to the image and LiDAR features using a novel
MSF3DDETR cross-attention block. Secondly, the object queries interacts with
each other in multi-head self-attention block. Finally, MSF3DDETR block is
repeated for number of times to refine the object queries. The MSF3DDETR
network is trained end-to-end on the nuScenes dataset using Hungarian algorithm
based bipartite matching and set-to-set loss inspired by DETR. We present both
quantitative and qualitative results which are competitive to the
state-of-the-art approaches.Comment: Accepted at the ICPR 2022 Workshop DLVDR202
Sensor Fusion for Object Detection and Tracking in Autonomous Vehicles
Autonomous driving vehicles depend on their perception system to understand the environment and identify all static and dynamic obstacles surrounding the vehicle. The perception system in an autonomous vehicle uses the sensory data obtained from different sensor modalities to understand the environment and perform a variety of tasks such as object detection and object tracking. Combining the outputs of different sensors to obtain a more reliable and robust outcome is called sensor fusion. This dissertation studies the problem of sensor fusion for object detection and object tracking in autonomous driving vehicles and explores different approaches for utilizing deep neural networks to accurately and efficiently fuse sensory data from different sensing modalities.
In particular, this dissertation focuses on fusing radar and camera data for 2D and 3D object detection and object tracking tasks. First, the effectiveness of radar and camera fusion for 2D object detection is investigated by introducing a radar region proposal algorithm for generating object proposals in a two-stage object detection network. The evaluation results show significant improvement in speed and accuracy compared to a vision-based proposal generation method. Next, radar and camera fusion is used for the task of joint object detection and depth estimation where the radar data is used in conjunction with image features to generate object proposals, but also provides accurate depth estimation for the detected objects in the scene. A fusion algorithm is also proposed for 3D object detection where where the depth and velocity data obtained from the radar is fused with the camera images to detect objects in 3D and also accurately estimate their velocities without requiring any temporal information. Finally, radar and camera sensor fusion is used for 3D multi-object tracking by introducing an end-to-end trainable and online network capable of tracking objects in real-time
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation
Jointly processing information from multiple sensors is crucial to achieving
accurate and robust perception for reliable autonomous driving systems.
However, current 3D perception research follows a modality-specific paradigm,
leading to additional computation overheads and inefficient collaboration
between different sensor data. In this paper, we present an efficient
multi-modal backbone for outdoor 3D perception named UniTR, which processes a
variety of modalities with unified modeling and shared parameters. Unlike
previous works, UniTR introduces a modality-agnostic transformer encoder to
handle these view-discrepant sensor data for parallel modal-wise representation
learning and automatic cross-modal interaction without additional fusion steps.
More importantly, to make full use of these complementary sensor types, we
present a novel multi-modal integration strategy by both considering
semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood
relations. UniTR is also a fundamentally task-agnostic backbone that naturally
supports different 3D perception tasks. It sets a new state-of-the-art
performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object
detection and +12.0 higher mIoU for BEV map segmentation with lower inference
latency. Code will be available at https://github.com/Haiyang-W/UniTR .Comment: Accepted by ICCV202
Complexer-YOLO: Real-Time 3D Object Detection and Tracking on Semantic Point Clouds
Accurate detection of 3D objects is a fundamental problem in computer vision
and has an enormous impact on autonomous cars, augmented/virtual reality and
many applications in robotics. In this work we present a novel fusion of neural
network based state-of-the-art 3D detector and visual semantic segmentation in
the context of autonomous driving. Additionally, we introduce
Scale-Rotation-Translation score (SRTs), a fast and highly parameterizable
evaluation metric for comparison of object detections, which speeds up our
inference time up to 20\% and halves training time. On top, we apply
state-of-the-art online multi target feature tracking on the object
measurements to further increase accuracy and robustness utilizing temporal
information. Our experiments on KITTI show that we achieve same results as
state-of-the-art in all related categories, while maintaining the performance
and accuracy trade-off and still run in real-time. Furthermore, our model is
the first one that fuses visual semantic with 3D object detection
- …