814 research outputs found
Optimized Gated Deep Learning Architectures for Sensor Fusion
Sensor fusion is a key technology that integrates various sensory inputs to
allow for robust decision making in many applications such as autonomous
driving and robot control. Deep neural networks have been adopted for sensor
fusion in a body of recent studies. Among these, the so-called netgated
architecture was proposed, which has demonstrated improved performances over
the conventional convolutional neural networks (CNN). In this paper, we address
several limitations of the baseline negated architecture by proposing two
further optimized architectures: a coarser-grained gated architecture employing
(feature) group-level fusion weights and a two-stage gated architectures
leveraging both the group-level and feature level fusion weights. Using driving
mode prediction and human activity recognition datasets, we demonstrate the
significant performance improvements brought by the proposed gated
architectures and also their robustness in the presence of sensor noise and
failures.Comment: 10 pages, 5 figures. Submitted to ICLR 201
HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection
Besides standard cameras, autonomous vehicles typically include multipleadditional sensors, such as lidars and radars, which help acquire richerinformation for perceiving the content of the driving scene. While severalrecent works focus on fusing certain pairs of sensors - such as camera andlidar or camera and radar - by using architectural components specific to theexamined setting, a generic and modular sensor fusion architecture is missingfrom the literature. In this work, we focus on 2D object detection, afundamental high-level task which is defined on the 2D image domain, andpropose HRFuser, a multi-resolution sensor fusion architecture that scalesstraightforwardly to an arbitrary number of input modalities. The design ofHRFuser is based on state-of-the-art high-resolution networks for image-onlydense prediction and incorporates a novel multi-window cross-attention block asthe means to perform fusion of multiple modalities at multiple resolutions.Even though cameras alone provide very informative features for 2D detection,we demonstrate via extensive experiments on the nuScenes and Seeing Through Fogdatasets that our model effectively leverages complementary features fromadditional modalities, substantially improving upon camera-only performance andconsistently outperforming state-of-the-art fusion methods for 2D detectionboth in normal and adverse conditions. The source code will be made publiclyavailable.<br
Robust Deep Multi-Modal Sensor Fusion using Fusion Weight Regularization and Target Learning
Sensor fusion has wide applications in many domains including health care and
autonomous systems. While the advent of deep learning has enabled promising
multi-modal fusion of high-level features and end-to-end sensor fusion
solutions, existing deep learning based sensor fusion techniques including deep
gating architectures are not always resilient, leading to the issue of fusion
weight inconsistency. We propose deep multi-modal sensor fusion architectures
with enhanced robustness particularly under the presence of sensor failures. At
the core of our gating architectures are fusion weight regularization and
fusion target learning operating on auxiliary unimodal sensing networks
appended to the main fusion model. The proposed regularized gating
architectures outperform the existing deep learning architectures with and
without gating under both clean and corrupted sensory inputs resulted from
sensor failures. The demonstrated improvements are particularly pronounced when
one or more multiple sensory modalities are corrupted.Comment: 8 page
3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection
In this paper, we propose a new deep architecture for fusing camera and LiDAR
sensors for 3D object detection. Because the camera and LiDAR sensor signals
have different characteristics and distributions, fusing these two modalities
is expected to improve both the accuracy and robustness of 3D object detection.
One of the challenges presented by the fusion of cameras and LiDAR is that the
spatial feature maps obtained from each modality are represented by
significantly different views in the camera and world coordinates; hence, it is
not an easy task to combine two heterogeneous feature maps without loss of
information. To address this problem, we propose a method called 3D-CVF that
combines the camera and LiDAR features using the cross-view spatial feature
fusion strategy. First, the method employs auto-calibrated projection, to
transform the 2D camera features to a smooth spatial feature map with the
highest correspondence to the LiDAR features in the bird's eye view (BEV)
domain. Then, a gated feature fusion network is applied to use the spatial
attention maps to mix the camera and LiDAR features appropriately according to
the region. Next, camera-LiDAR feature fusion is also achieved in the
subsequent proposal refinement stage. The camera feature is used from the 2D
camera-view domain via 3D RoI grid pooling and fused with the BEV feature for
proposal refinement. Our evaluations, conducted on the KITTI and nuScenes 3D
object detection datasets demonstrate that the camera-LiDAR fusion offers
significant performance gain over single modality and that the proposed 3D-CVF
achieves state-of-the-art performance in the KITTI benchmark
3D Dual-Fusion: Dual-Domain Dual-Query Camera-LiDAR Fusion for 3D Object Detection
Fusing data from cameras and LiDAR sensors is an essential technique to
achieve robust 3D object detection. One key challenge in camera-LiDAR fusion
involves mitigating the large domain gap between the two sensors in terms of
coordinates and data distribution when fusing their features. In this paper, we
propose a novel camera-LiDAR fusion architecture called, 3D Dual-Fusion, which
is designed to mitigate the gap between the feature representations of camera
and LiDAR data. The proposed method fuses the features of the camera-view and
3D voxel-view domain and models their interactions through deformable
attention. We redesign the transformer fusion encoder to aggregate the
information from the two domains. Two major changes include 1) dual query-based
deformable attention to fuse the dual-domain features interactively and 2) 3D
local self-attention to encode the voxel-domain queries prior to dual-query
decoding. The results of an experimental evaluation show that the proposed
camera-LiDAR fusion architecture achieved competitive performance on the KITTI
and nuScenes datasets, with state-of-the-art performances in some 3D object
detection benchmarks categories.Comment: 12 pages, 3 figure
Multimodal Panoptic Segmentation of 3D Point Clouds
The understanding and interpretation of complex 3D environments is a key challenge of autonomous driving. Lidar sensors and their recorded point clouds are particularly interesting for this challenge since they provide accurate 3D information about the environment. This work presents a multimodal approach based on deep learning for panoptic segmentation of 3D point clouds. It builds upon and combines the three key aspects multi view architecture, temporal feature fusion, and deep sensor fusion
Radar Voxel Fusion for 3D Object Detection
Automotive traffic scenes are complex due to the variety of possible
scenarios, objects, and weather conditions that need to be handled. In contrast
to more constrained environments, such as automated underground trains,
automotive perception systems cannot be tailored to a narrow field of specific
tasks but must handle an ever-changing environment with unforeseen events. As
currently no single sensor is able to reliably perceive all relevant activity
in the surroundings, sensor data fusion is applied to perceive as much
information as possible. Data fusion of different sensors and sensor modalities
on a low abstraction level enables the compensation of sensor weaknesses and
misdetections among the sensors before the information-rich sensor data are
compressed and thereby information is lost after a sensor-individual object
detection. This paper develops a low-level sensor fusion network for 3D object
detection, which fuses lidar, camera, and radar data. The fusion network is
trained and evaluated on the nuScenes data set. On the test set, fusion of
radar data increases the resulting AP (Average Precision) detection score by
about 5.1% in comparison to the baseline lidar network. The radar sensor fusion
proves especially beneficial in inclement conditions such as rain and night
scenes. Fusing additional camera data contributes positively only in
conjunction with the radar fusion, which shows that interdependencies of the
sensors are important for the detection result. Additionally, the paper
proposes a novel loss to handle the discontinuity of a simple yaw
representation for object detection. Our updated loss increases the detection
and orientation estimation performance for all sensor input configurations. The
code for this research has been made available on GitHub
- …