3 research outputs found

    Semantic Scene Segmentation with Minimal Labeling Effort

    Get PDF
    Semantic scene segmentation - the process of assigning a semantic label to every pixel in an input image - is an important task in computer vision where an autonomous system or a robot needs to differentiate between different parts of the scene/objects and recognize the class of each one for adequate physical interactions. The most successful methods that try to solve this problem are fully-supervised approaches based on Convolutional Neural Networks (CNNs). Unfortunately, these methods require large amounts of training images with pixel-level annotations, which are expensive and time-consuming to obtain. In this thesis, we aim to alleviate the manual effort of annotating real images by designing either weakly-supervised learning strategies that can leverage image-level annotations, such as image tags, which are cheaper to obtain, or effective ways to exploit synthetic data which can be labeled automatically. In particular, we make several contributions to the literature of semantic scene segmentation with minimal labeling effort. Firstly, we introduce a novel weakly-supervised semantic segmentation technique to address the problem of semantic scene segmentation with one of the minimal level of human supervision, image-level tags, which simply determines present and absent classes within an image. The proposed method is able to extract markedly accurate foreground/background masks from the pre-trained network itself, forgoing external objectness modules or using pixel-level/bounding box annotations, and use them as priors in an appropriate loss function. Secondly, we improve the performance of this framework by extracting class-specific foreground masks instead of a single generic foreground mask, with virtually no additional annotation cost. Thirdly, we found that a general limitation of existing tag-based semantic segmentation techniques is the assumption of having just one background class in the scene, which, by relying on the object recognition pre-trained networks or objectness modules, restricts the applicability of these methods to segmenting foreground objects only. However, in practical applications, such as autonomous navigation, it is often crucial to reason about multiple background classes. Thus, in this thesis, we introduce a weakly-supervised video semantic segmentation method in which there are multiple foreground and multiple background classes in the scene. To this end, we propose an approach to doing so by making use of classifier heatmaps. Then, we develop a two-stream deep architecture that can jointly leverage appearance and motion, and we design a loss based on the heatmaps to train this network. In the last contribution, we propose a novel technique for using synthetic data which lets us perform semantic segmentation without having any manual annotation, not even image-level tags. Although there exist approaches that utilize synthetic data, we use a drastically different way to handle synthetic images that does not require seeing any real images during training time. This approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently. All the methods introduced in this thesis are evaluated on standard semantic segmentation datasets consisting of single background and multiple background scenes. The experiments of each chapter provide compelling evidence that all of our approaches are more efficient than the contemporary baselines. All in all, semantic scene segmentation methods with minimal labeling effort, such as those in this thesis, are crucial for having less expensive annotation processes in terms of time and money. Moreover, this will make large-scale semantic segmentation much more practical than the current models relying on full supervision, as well as lead to solutions that generalize much better than existing ones, thanks to the use of images depicting a great diversity of scenes

    Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

    Get PDF
    Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require training pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract markedly more accurate masks from the pre-trained network itself, forgoing external objectness modules. This is accomplished using the activations of the higher-level convolutional layers, smoothed by a dense CRF. We demonstrate that our method, based on these masks and a weakly-supervised loss, outperforms the state-of-the-art tag-based weakly-supervised semantic segmentation techniques. Furthermore, we introduce a new form of inexpensive weak supervision yielding an additional accuracy boost

    Incorporating Network Built-in Priors in Weakly-Supervised Semantic Segmentation

    No full text
    Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract accurate masks from networks pre-trained for the task of object recognition, thus forgoing external objectness modules. We first show how foreground/ background masks can be obtained from the activations of higher-level convolutional layers of a network. We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network. Our experiments evidence that exploiting these masks in conjunction with a weakly-supervised training loss yields state-ofthe-art tag-based weakly-supervised semantic segmentation results
    corecore