4 research outputs found
Depth Based Semantic Scene Completion with Position Importance Aware Loss
Semantic Scene Completion (SSC) refers to the task of inferring the 3D
semantic segmentation of a scene while simultaneously completing the 3D shapes.
We propose PALNet, a novel hybrid network for SSC based on single depth. PALNet
utilizes a two-stream network to extract both 2D and 3D features from
multi-stages using fine-grained depth information to efficiently captures the
context, as well as the geometric cues of the scene. Current methods for SSC
treat all parts of the scene equally causing unnecessary attention to the
interior of objects. To address this problem, we propose Position Aware
Loss(PA-Loss) which is position importance aware while training the network.
Specifically, PA-Loss considers Local Geometric Anisotropy to determine the
importance of different positions within the scene. It is beneficial for
recovering key details like the boundaries of objects and the corners of the
scene. Comprehensive experiments on two benchmark datasets demonstrate the
effectiveness of the proposed method and its superior performance. Models and
Video demo can be found at: https://github.com/UniLauX/PALNet.Comment: ICRA2020 In Conjuction With RA
Anisotropic Convolutional Networks for 3D Semantic Scene Completion
As a voxel-wise labeling task, semantic scene completion (SSC) tries to
simultaneously infer the occupancy and semantic labels for a scene from a
single depth and/or RGB image. The key challenge for SSC is how to effectively
take advantage of the 3D context to model various objects or stuffs with severe
variations in shapes, layouts and visibility. To handle such variations, we
propose a novel module called anisotropic convolution, which properties with
flexibility and power impossible for the competing methods such as standard 3D
convolution and some of its variations. In contrast to the standard 3D
convolution that is limited to a fixed 3D receptive field, our module is
capable of modeling the dimensional anisotropy voxel-wisely. The basic idea is
to enable anisotropic 3D receptive field by decomposing a 3D convolution into
three consecutive 1D convolutions, and the kernel size for each such 1D
convolution is adaptively determined on the fly. By stacking multiple such
anisotropic convolution modules, the voxel-wise modeling capability can be
further enhanced while maintaining a controllable amount of model parameters.
Extensive experiments on two SSC benchmarks, NYU-Depth-v2 and NYUCAD, show the
superior performance of the proposed method. Our code is available at
https://waterljwant.github.io/SSC
S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point Clouds
With the increasing reliance of self-driving and similar robotic systems on
robust 3D vision, the processing of LiDAR scans with deep convolutional neural
networks has become a trend in academia and industry alike. Prior attempts on
the challenging Semantic Scene Completion task - which entails the inference of
dense 3D structure and associated semantic labels from "sparse" representations
- have been, to a degree, successful in small indoor scenes when provided with
dense point clouds or dense depth maps often fused with semantic segmentation
maps from RGB images. However, the performance of these systems drop
drastically when applied to large outdoor scenes characterized by dynamic and
exponentially sparser conditions. Likewise, processing of the entire sparse
volume becomes infeasible due to memory limitations and workarounds introduce
computational inefficiency as practitioners are forced to divide the overall
volume into multiple equal segments and infer on each individually, rendering
real-time performance impossible. In this work, we formulate a method that
subsumes the sparsity of large-scale environments and present S3CNet, a sparse
convolution based neural network that predicts the semantically completed scene
from a single, unified LiDAR point cloud. We show that our proposed method
outperforms all counterparts on the 3D task, achieving state-of-the art results
on the SemanticKITTI benchmark. Furthermore, we propose a 2D variant of S3CNet
with a multi-view fusion strategy to complement our 3D network, providing
robustness to occlusions and extreme sparsity in distant regions. We conduct
experiments for the 2D semantic scene completion task and compare the results
of our sparse 2D network against several leading LiDAR segmentation models
adapted for bird's eye view segmentation on two open-source datasets.Comment: 14 page
Deep Learning for 2D and 3D Scene Understanding
This thesis comprises a body of work that investigates the use of deep learning for 2D and 3D scene understanding. Although there has been significant progress made in computer vision using deep learning, a lot of that progress has been relative to performance benchmarks, and for static images; it is common to find that good performance on one benchmark does not necessarily mean good generalization to the kind of viewing conditions that might be encountered by an autonomous robot or agent. In this thesis, we address a variety of problems motivated by the desire to see deep learning algorithms generalize better to robotic vision scenarios. Specifically, we span topics of multi-object detection, unsupervised domain adaptation for semantic segmentation, video object segmentation, and semantic scene completion. First, most modern object detectors use a final post-processing step known as Non-maximum suppression (GreedyNMS). This suffers an inevitable trade-off between precision and recall in crowded scenes. To overcome this limitation, we propose a Pairwise-NMS to cure GreedyNMS. Specifically, a pairwise-relationship network that is based on deep learning is learned to predict if two overlapping proposal boxes contain two objects or zero/one object, which can handle multiple overlapping objects effectively. A common issue in training deep neural networks is the need for large training sets. One approach to this is to use simulated image and video data, but this suffers from a domain gap wherein the performance on real-world data is poor relative to performance on the simulation data. We target a few approaches to addressing so-called domain adaptation for semantic segmentation: (1) Single and multi-exemplars are employed for each class in order to cluster the per-pixel features in the embedding space; (2) Class-balanced self-training strategy is utilized for generating pseudo labels in the target domain; (3) Moreover, a convolutional adaptor is adopted to enforce the features in the source domain and target domain are closed with each other. Next, we tackle the video object segmentation by formulating it as a meta-learning problem, where the base learner aims to learn semantic scene understanding for general objects, and the meta learner quickly adapts the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed-form optimizer, the so-called \ridge regression", which is conducive to fast and better training convergence. One-shot video object segmentation (OSVOS) has the limitation to \overemphasize" the generic semantic object information while \diluting" the instance cues of the object(s), which largely block the whole training process. Through adding a common module, video loss, which we formulate with various forms of constraints (including weighted BCE loss, high-dimensional triplet loss, as well as a novel mixed instance-aware video loss), to train the parent network, the network is then better prepared for the online fine-tuning. Next, we introduce a light-weight Dimensional Decomposition Residual network (DDR) for 3D dense prediction tasks. The novel factorized convolution layer is effective for reducing the network parameters, and the proposed multi-scale fusion mechanism for depth and color image can improve the completion and segmentation accuracy simultaneously. Moreover, we propose PALNet, a novel hybrid network for Semantic Scene Completion(SSC) based on single depth. PALNet utilizes a two-stream network to extract both 2D and 3D features from multi-stages using fine-grained depth information to eficiently capture the context, as well as the geometric cues of the scene. Position Aware Loss (PA-Loss) considers Local Geometric Anisotropy to determine the importance of different positions within the scene. It is beneficial for recovering key details like the boundaries of objects and the corners of the scene. Finally, we propose a 3D gated recurrent fusion network (GRFNet), which learns to adaptively select and fuse the relevant information from depth and RGB by making use of the gate and memory modules. Based on the single-stage fusion, we further propose a multi-stage fusion strategy, which could model the correlations among different stages within the network.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202