21 research outputs found

    RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling

    No full text
    Beyond the success in classification, neural networks have recently shown strong results on pixel-wise prediction tasks like image semantic segmentation on RGBD data. However, the commonly used deconvolutional layers for upsampling intermediate representations to the full-resolution output still show different failure modes, like imprecise segmentation boundaries and label mistakes in particular on large, weakly textured objects (e.g. fridge, whiteboard, door). We attribute these errors in part to the rigid way, current network aggregate information, that can be either too local (missing context) or too global (inaccurate boundaries). Therefore we propose a data-driven pooling layer that integrates with fully convolutional architectures and utilizes boundary detection from RGBD image segmentation approaches. We extend our approach to leverage region-level correspondences across images with an additional temporal pooling stage. We evaluate our approach on the NYU-Depth-V2 dataset comprised of indoor RGBD video sequences and compare it to various state-of-the-art baselines. Besides a general improvement over the state-of-the-art, our approach shows particularly good results in terms of accuracy of the predicted boundaries and in segmenting previously problematic classes

    Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras

    Full text link
    Visual scene understanding is an important capability that enables robots to purposefully act in their environment. In this paper, we propose a novel approach to object-class segmentation from multiple RGB-D views using deep learning. We train a deep neural network to predict object-class semantics that is consistent from several view points in a semi-supervised way. At test time, the semantics predictions of our network can be fused more consistently in semantic keyframe maps than predictions of a network trained on individual views. We base our network architecture on a recent single-view deep learning approach to RGB and depth fusion for semantic object-class segmentation and enhance it with multi-scale loss minimization. We obtain the camera trajectory using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth annotated frames in order to enforce multi-view consistency during training. At test time, predictions from multiple views are fused into keyframes. We propose and analyze several methods for enforcing multi-view consistency during training and testing. We evaluate the benefit of multi-view consistency training and demonstrate that pooling of deep features and fusion over multiple views outperforms single-view baselines on the NYUDv2 benchmark for semantic segmentation. Our end-to-end trained network achieves state-of-the-art performance on the NYUDv2 dataset in single-view segmentation as well as multi-view semantic fusion.Comment: the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2017

    Three for one and one for three: Flow, Segmentation, and Surface Normals

    Get PDF
    Optical flow, semantic segmentation, and surface normals represent different information modalities, yet together they bring better cues for scene understanding problems. In this paper, we study the influence between the three modalities: how one impacts on the others and their efficiency in combination. We employ a modular approach using a convolutional refinement network which is trained supervised but isolated from RGB images to enforce joint modality features. To assist the training process, we create a large-scale synthetic outdoor dataset that supports dense annotation of semantic segmentation, optical flow, and surface normals. The experimental results show positive influence among the three modalities, especially for objects' boundaries, region consistency, and scene structures.Comment: BMVC 201
    corecore