2,353 research outputs found

    Adversarial Semantic Scene Completion from a Single Depth Image

    Full text link
    We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features. This is done by correlating the latent features of the encoder working on partial 2.5D data with the latent features extracted from a variational 3D auto-encoder trained to reconstruct the complete semantic scene. In addition, differently from other approaches that operate entirely through 3D convolutions, at test time we retain the original 2.5D structure of the input during downsampling to improve the effectiveness of the internal representation of our model. We test our approach on the main benchmark datasets for semantic scene completion to qualitatively and quantitatively assess the effectiveness of our proposal.Comment: 2018 International Conference on 3D Vision (3DV

    Im2Pano3D: Extrapolating 360 Structure and Semantics Beyond the Field of View

    Full text link
    We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360 panoramic view of an indoor scene when given only a partial observation (<= 50%) in the form of an RGB-D image. To make this possible, Im2Pano3D leverages strong contextual priors learned from large-scale synthetic and real-world indoor scenes. To ease the prediction of 3D structure, we propose to parameterize 3D surfaces with their plane equations and train the model to predict these parameters directly. To provide meaningful training supervision, we use multiple loss functions that consider both pixel level accuracy and global context consistency. Experiments demon- strate that Im2Pano3D is able to predict the semantics and 3D structure of the unobserved scene with more than 56% pixel accuracy and less than 0.52m average distance error, which is significantly better than alternative approaches.Comment: Video summary: https://youtu.be/Au3GmktK-S

    Dense 3D Object Reconstruction from a Single Depth View

    Get PDF
    In this paper, we propose a novel approach, 3D-RecGAN++, which reconstructs the complete 3D structure of a given object from a single arbitrary depth view using generative adversarial networks. Unlike existing work which typically requires multiple views of the same object or class labels to recover the full 3D geometry, the proposed 3D-RecGAN++ only takes the voxel grid representation of a depth view of the object as input, and is able to generate the complete 3D occupancy grid with a high resolution of 256^3 by recovering the occluded/missing regions. The key idea is to combine the generative capabilities of autoencoders and the conditional Generative Adversarial Networks (GAN) framework, to infer accurate and fine-grained 3D structures of objects in high-dimensional voxel space. Extensive experiments on large synthetic datasets and real-world Kinect datasets show that the proposed 3D-RecGAN++ significantly outperforms the state of the art in single view 3D object reconstruction, and is able to reconstruct unseen types of objects.Comment: TPAMI 2018. Code and data are available at: https://github.com/Yang7879/3D-RecGAN-extended. This article extends from arXiv:1708.0796

    Learning Shape Priors for Single-View 3D Completion and Reconstruction

    Full text link
    The problem of single-view 3D shape completion or reconstruction is challenging, because among the many possible shapes that explain an observation, most are implausible and do not correspond to natural objects. Recent research in the field has tackled this problem by exploiting the expressiveness of deep convolutional networks. In fact, there is another level of ambiguity that is often overlooked: among plausible shapes, there are still multiple shapes that fit the 2D image equally well; i.e., the ground truth shape is non-deterministic given a single-view input. Existing fully supervised approaches fail to address this issue, and often produce blurry mean shapes with smooth surfaces but no fine details. In this paper, we propose ShapeHD, pushing the limit of single-view shape completion and reconstruction by integrating deep generative models with adversarially learned shape priors. The learned priors serve as a regularizer, penalizing the model only if its output is unrealistic, not if it deviates from the ground truth. Our design thus overcomes both levels of ambiguity aforementioned. Experiments demonstrate that ShapeHD outperforms state of the art by a large margin in both shape completion and shape reconstruction on multiple real datasets.Comment: ECCV 2018. The first two authors contributed equally to this work. Project page: http://shapehd.csail.mit.edu

    Semantic Segmentation and Completion of 2D and 3D Scenes

    Get PDF
    Semantic segmentation is one of the fundamental problems in computer vision. This thesis addresses various tasks, all related to the fine-grained, i.e. pixel-wise or voxel-wise, semantic understanding of a scene. In the recent years semantic segmentation by 2D convolutional neural networks has become as much as a default pre-processing step for many other computer vision tasks, since it outputs very rich spatially resolved feature maps and semantic labels that are useful for many higher level recognition tasks. In this thesis, we make several contributions to the field of semantic scene understanding using an image or a depth measurement, recorded by different types of laser sensors, as input. Firstly, we propose a new approach to 2D semantic segmentation of images. It consists of an adaptation of an existing approach for real time capability under constrained hardware demands that are required by a real life drone. The approach is based on a highly optimized implementation of random forests combined with a label propagation strategy. Next, we shift our focus to what we believe is one of the important next forefronts in computer vision: To give machines the ability to anticipate and extrapolate beyond what is captured in a single frame by a camera or depth sensor. This anticipation capability is what allows humans to efficiently interact with their environment. The need for this ability is most prominently displayed in the behaviour of today's autonomous cars. One of their shortcomings is that they only interpret the current sensor state, which prevents them from anticipating events which would require an adaptation of their driving policy. The result is a lot of sudden breaks and non-human-like driving behaviour, which can provoke accidents or negatively impact the traffic flow. Therefore we first propose a task to spatially anticipate semantic labels outside the field of view of an image. The task is based on the Cityscapes dataset, where each image has been center cropped. The goal is to train an algorithm that predicts the semantic segmentation map in the area outside the cropped input region. Along with the task itself, we propose an efficient iterative approach based on 2D convolutional neural networks by designing a task adapted loss function. Afterwards, we switch to the 3D domain. In three dimensions the goal shifts from assigning pixel-wise labels towards the reconstruction of the full 3D scene using a grid of labeled voxels. Thereby one has to anticipate the semantics and geometry in the space that is occluded by the objects themselves from the viewpoint of an image or laser sensor. The task is known as 3D semantic scene completion and has recently caught a lot of attention. Here we propose two new approaches that advance the performance of existing 3D semantic scene completion baselines. The first one is a two stream approach where we leverage a multi-modal input consisting of images and Kinect depth measurements in an early fusion scheme. Moreover we propose a more memory efficient input embedding. The second approach to semantic scene completion leverages the power of the recently introduced generative adversarial networks (GANs). Here we construct a network architecture that follows the GAN principles and uses a discriminator network as an additional regularizer in the 3D-CNN training. With our proposed approaches in semantic scene completion we achieve a new state-of-the-art performance on two benchmark datasets. Finally we observe that one of the shortcomings in semantic scene completion is the lack of a realistic, large scale dataset. We therefore introduce the first real world dataset for semantic scene completion based on the KITTI odometry benchmark. By semantically annotating alls scans of a 10 Hz Velodyne laser scanner, driving through urban and countryside areas, we obtain data that is valuable for many tasks including semantic scene completion. Along with the data we explore the performance of current semantic scene completion models as well as models for semantic point cloud segmentation and motion segmentation. The results show that there is still a lot of space for improvement for either tasks so our dataset is a valuable contribution for future research into these directions
    • …
    corecore