485 research outputs found

    A Critical Review of Deep Learning-Based Multi-Sensor Fusion Techniques

    Get PDF
    In this review, we provide a detailed coverage of multi-sensor fusion techniques that use RGB stereo images and a sparse LiDAR-projected depth map as input data to output a dense depth map prediction. We cover state-of-the-art fusion techniques which, in recent years, have been deep learning-based methods that are end-to-end trainable. We then conduct a comparative evaluation of the state-of-the-art techniques and provide a detailed analysis of their strengths and limitations as well as the applications they are best suited for

    Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

    Full text link
    LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep LiDAR point cloud, the accurate semantic segmentation is non-trivial to achieve. In this paper, we propose a novel sparse LiDAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-to-end training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i.e., promoting the interaction of incomplete local geometry of point cloud and complete voxel-wise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C-Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i.e., 4% and 3% improvement correspondingly.Comment: To appear in AAAI 2021. Codes are available at https://github.com/yanx27/JS3C-Ne

    Deep Learning for 2D and 3D Scene Understanding

    Get PDF
    This thesis comprises a body of work that investigates the use of deep learning for 2D and 3D scene understanding. Although there has been significant progress made in computer vision using deep learning, a lot of that progress has been relative to performance benchmarks, and for static images; it is common to find that good performance on one benchmark does not necessarily mean good generalization to the kind of viewing conditions that might be encountered by an autonomous robot or agent. In this thesis, we address a variety of problems motivated by the desire to see deep learning algorithms generalize better to robotic vision scenarios. Specifically, we span topics of multi-object detection, unsupervised domain adaptation for semantic segmentation, video object segmentation, and semantic scene completion. First, most modern object detectors use a final post-processing step known as Non-maximum suppression (GreedyNMS). This suffers an inevitable trade-off between precision and recall in crowded scenes. To overcome this limitation, we propose a Pairwise-NMS to cure GreedyNMS. Specifically, a pairwise-relationship network that is based on deep learning is learned to predict if two overlapping proposal boxes contain two objects or zero/one object, which can handle multiple overlapping objects effectively. A common issue in training deep neural networks is the need for large training sets. One approach to this is to use simulated image and video data, but this suffers from a domain gap wherein the performance on real-world data is poor relative to performance on the simulation data. We target a few approaches to addressing so-called domain adaptation for semantic segmentation: (1) Single and multi-exemplars are employed for each class in order to cluster the per-pixel features in the embedding space; (2) Class-balanced self-training strategy is utilized for generating pseudo labels in the target domain; (3) Moreover, a convolutional adaptor is adopted to enforce the features in the source domain and target domain are closed with each other. Next, we tackle the video object segmentation by formulating it as a meta-learning problem, where the base learner aims to learn semantic scene understanding for general objects, and the meta learner quickly adapts the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed-form optimizer, the so-called \ridge regression", which is conducive to fast and better training convergence. One-shot video object segmentation (OSVOS) has the limitation to \overemphasize" the generic semantic object information while \diluting" the instance cues of the object(s), which largely block the whole training process. Through adding a common module, video loss, which we formulate with various forms of constraints (including weighted BCE loss, high-dimensional triplet loss, as well as a novel mixed instance-aware video loss), to train the parent network, the network is then better prepared for the online fine-tuning. Next, we introduce a light-weight Dimensional Decomposition Residual network (DDR) for 3D dense prediction tasks. The novel factorized convolution layer is effective for reducing the network parameters, and the proposed multi-scale fusion mechanism for depth and color image can improve the completion and segmentation accuracy simultaneously. Moreover, we propose PALNet, a novel hybrid network for Semantic Scene Completion(SSC) based on single depth. PALNet utilizes a two-stream network to extract both 2D and 3D features from multi-stages using fine-grained depth information to eficiently capture the context, as well as the geometric cues of the scene. Position Aware Loss (PA-Loss) considers Local Geometric Anisotropy to determine the importance of different positions within the scene. It is beneficial for recovering key details like the boundaries of objects and the corners of the scene. Finally, we propose a 3D gated recurrent fusion network (GRFNet), which learns to adaptively select and fuse the relevant information from depth and RGB by making use of the gate and memory modules. Based on the single-stage fusion, we further propose a multi-stage fusion strategy, which could model the correlations among different stages within the network.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer

    Full text link
    Limited by hardware cost and system size, camera's Field-of-View (FoV) is not always satisfactory. However, from a spatio-temporal perspective, information beyond the camera's physical FoV is off-the-shelf and can actually be obtained "for free" from the past. In this paper, we propose a novel task termed Beyond-FoV Estimation, aiming to exploit past visual cues and bidirectional break through the physical FoV of a camera. We put forward a FlowLens architecture to expand the FoV by achieving feature propagation explicitly by optical flow and implicitly by a novel clip-recurrent transformer, which has two appealing features: 1) FlowLens comprises a newly proposed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated in the temporal dimension. 2) A multi-branch Mix Fusion Feed Forward Network (MixF3N) is integrated to enhance the spatially-precise flow of local features. To foster training and evaluation, we establish KITTI360-EX, a dataset for outer- and inner FoV expansion. Extensive experiments on both video inpainting and beyond-FoV estimation tasks show that FlowLens achieves state-of-the-art performance. Code will be made publicly available at https://github.com/MasterHow/FlowLens.Comment: Code will be made publicly available at https://github.com/MasterHow/FlowLen
    • …
    corecore