937 research outputs found

    Semantic Scene Segmentation with Minimal Labeling Effort

    Get PDF
    Semantic scene segmentation - the process of assigning a semantic label to every pixel in an input image - is an important task in computer vision where an autonomous system or a robot needs to differentiate between different parts of the scene/objects and recognize the class of each one for adequate physical interactions. The most successful methods that try to solve this problem are fully-supervised approaches based on Convolutional Neural Networks (CNNs). Unfortunately, these methods require large amounts of training images with pixel-level annotations, which are expensive and time-consuming to obtain. In this thesis, we aim to alleviate the manual effort of annotating real images by designing either weakly-supervised learning strategies that can leverage image-level annotations, such as image tags, which are cheaper to obtain, or effective ways to exploit synthetic data which can be labeled automatically. In particular, we make several contributions to the literature of semantic scene segmentation with minimal labeling effort. Firstly, we introduce a novel weakly-supervised semantic segmentation technique to address the problem of semantic scene segmentation with one of the minimal level of human supervision, image-level tags, which simply determines present and absent classes within an image. The proposed method is able to extract markedly accurate foreground/background masks from the pre-trained network itself, forgoing external objectness modules or using pixel-level/bounding box annotations, and use them as priors in an appropriate loss function. Secondly, we improve the performance of this framework by extracting class-specific foreground masks instead of a single generic foreground mask, with virtually no additional annotation cost. Thirdly, we found that a general limitation of existing tag-based semantic segmentation techniques is the assumption of having just one background class in the scene, which, by relying on the object recognition pre-trained networks or objectness modules, restricts the applicability of these methods to segmenting foreground objects only. However, in practical applications, such as autonomous navigation, it is often crucial to reason about multiple background classes. Thus, in this thesis, we introduce a weakly-supervised video semantic segmentation method in which there are multiple foreground and multiple background classes in the scene. To this end, we propose an approach to doing so by making use of classifier heatmaps. Then, we develop a two-stream deep architecture that can jointly leverage appearance and motion, and we design a loss based on the heatmaps to train this network. In the last contribution, we propose a novel technique for using synthetic data which lets us perform semantic segmentation without having any manual annotation, not even image-level tags. Although there exist approaches that utilize synthetic data, we use a drastically different way to handle synthetic images that does not require seeing any real images during training time. This approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently. All the methods introduced in this thesis are evaluated on standard semantic segmentation datasets consisting of single background and multiple background scenes. The experiments of each chapter provide compelling evidence that all of our approaches are more efficient than the contemporary baselines. All in all, semantic scene segmentation methods with minimal labeling effort, such as those in this thesis, are crucial for having less expensive annotation processes in terms of time and money. Moreover, this will make large-scale semantic segmentation much more practical than the current models relying on full supervision, as well as lead to solutions that generalize much better than existing ones, thanks to the use of images depicting a great diversity of scenes

    A Weakly Supervised Approach for Estimating Spatial Density Functions from High-Resolution Satellite Imagery

    Full text link
    We propose a neural network component, the regional aggregation layer, that makes it possible to train a pixel-level density estimator using only coarse-grained density aggregates, which reflect the number of objects in an image region. Our approach is simple to use and does not require domain-specific assumptions about the nature of the density function. We evaluate our approach on several synthetic datasets. In addition, we use this approach to learn to estimate high-resolution population and housing density from satellite imagery. In all cases, we find that our approach results in better density estimates than a commonly used baseline. We also show how our housing density estimator can be used to classify buildings as residential or non-residential.Comment: 10 pages, 8 figures. ACM SIGSPATIAL 2018, Seattle, US

    A Cross-Season Correspondence Dataset for Robust Semantic Segmentation

    Full text link
    In this paper, we present a method to utilize 2D-2D point matches between images taken during different image conditions to train a convolutional neural network for semantic segmentation. Enforcing label consistency across the matches makes the final segmentation algorithm robust to seasonal changes. We describe how these 2D-2D matches can be generated with little human interaction by geometrically matching points from 3D models built from images. Two cross-season correspondence datasets are created providing 2D-2D matches across seasonal changes as well as from day to night. The datasets are made publicly available to facilitate further research. We show that adding the correspondences as extra supervision during training improves the segmentation performance of the convolutional neural network, making it more robust to seasonal changes and weather conditions.Comment: In Proc. CVPR 201

    Pseudo Mask Augmented Object Detection

    Full text link
    In this work, we present a novel and effective framework to facilitate object detection with the instance-level segmentation information that is only supervised by bounding box annotation. Starting from the joint object detection and instance segmentation network, we propose to recursively estimate the pseudo ground-truth object masks from the instance-level object segmentation network training, and then enhance the detection network with top-down segmentation feedbacks. The pseudo ground truth mask and network parameters are optimized alternatively to mutually benefit each other. To obtain the promising pseudo masks in each iteration, we embed a graphical inference that incorporates the low-level image appearance consistency and the bounding box annotations to refine the segmentation masks predicted by the segmentation network. Our approach progressively improves the object detection performance by incorporating the detailed pixel-wise information learned from the weakly-supervised segmentation network. Extensive evaluation on the detection task in PASCAL VOC 2007 and 2012 [12] verifies that the proposed approach is effective
    • …
    corecore