937 research outputs found
Semantic Scene Segmentation with Minimal Labeling Effort
Semantic scene segmentation - the process of assigning a semantic label to every pixel in an input image - is an important task in computer vision where an autonomous system or a robot needs to differentiate between different parts of the scene/objects and recognize the class of each one for adequate physical interactions. The most successful methods that try to solve this problem are fully-supervised approaches based on Convolutional Neural Networks (CNNs). Unfortunately, these methods require large amounts of training images with pixel-level annotations, which are expensive and time-consuming to obtain. In this thesis, we aim to alleviate the manual effort of annotating real images by designing either weakly-supervised learning strategies that can leverage image-level annotations, such as image tags, which are cheaper to obtain, or effective ways to exploit synthetic data which can be labeled automatically.
In particular, we make several contributions to the literature of semantic scene segmentation with minimal labeling effort. Firstly, we introduce a novel weakly-supervised semantic segmentation technique to address the problem of semantic scene segmentation with one of the minimal level of human supervision, image-level tags, which simply determines present and absent classes within an image. The proposed method is able to extract markedly accurate foreground/background masks from the pre-trained network itself, forgoing external objectness modules or using pixel-level/bounding box annotations, and use them as priors in an appropriate loss function. Secondly, we improve the performance of this framework by extracting class-specific foreground masks instead of a single generic foreground mask, with virtually no additional annotation cost. Thirdly, we found that a general limitation of existing tag-based semantic segmentation techniques is the assumption of having just one background class in the scene, which, by relying on the object recognition pre-trained networks or objectness modules, restricts the applicability of these methods to segmenting foreground objects only. However, in practical applications, such as autonomous navigation, it is often crucial to reason about multiple background classes. Thus, in this thesis, we introduce a weakly-supervised video semantic segmentation method in which there are multiple foreground and multiple background classes in the scene. To this end, we propose an approach to doing so by making use of classifier heatmaps. Then, we develop a two-stream deep architecture that can jointly leverage appearance and motion, and we design a loss based on the heatmaps to train this network. In the last contribution, we propose a novel technique for using synthetic data which lets us perform semantic segmentation without having any manual annotation, not even image-level tags. Although there exist approaches that utilize synthetic data, we use a drastically different way to handle synthetic images that does not require seeing any real images during training time. This approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently.
All the methods introduced in this thesis are evaluated on standard semantic segmentation datasets consisting of single background and multiple background scenes. The experiments of each chapter provide compelling evidence that all of our approaches are more efficient than the contemporary baselines.
All in all, semantic scene segmentation methods with minimal labeling effort, such as those in this thesis, are crucial for having less expensive annotation processes in terms of time and money. Moreover, this will make large-scale semantic segmentation much more practical than the current models relying on full supervision, as well as lead to solutions that generalize much better than existing ones, thanks to the use of images depicting a great diversity of scenes
A Weakly Supervised Approach for Estimating Spatial Density Functions from High-Resolution Satellite Imagery
We propose a neural network component, the regional aggregation layer, that
makes it possible to train a pixel-level density estimator using only
coarse-grained density aggregates, which reflect the number of objects in an
image region. Our approach is simple to use and does not require
domain-specific assumptions about the nature of the density function. We
evaluate our approach on several synthetic datasets. In addition, we use this
approach to learn to estimate high-resolution population and housing density
from satellite imagery. In all cases, we find that our approach results in
better density estimates than a commonly used baseline. We also show how our
housing density estimator can be used to classify buildings as residential or
non-residential.Comment: 10 pages, 8 figures. ACM SIGSPATIAL 2018, Seattle, US
A Cross-Season Correspondence Dataset for Robust Semantic Segmentation
In this paper, we present a method to utilize 2D-2D point matches between
images taken during different image conditions to train a convolutional neural
network for semantic segmentation. Enforcing label consistency across the
matches makes the final segmentation algorithm robust to seasonal changes. We
describe how these 2D-2D matches can be generated with little human interaction
by geometrically matching points from 3D models built from images. Two
cross-season correspondence datasets are created providing 2D-2D matches across
seasonal changes as well as from day to night. The datasets are made publicly
available to facilitate further research. We show that adding the
correspondences as extra supervision during training improves the segmentation
performance of the convolutional neural network, making it more robust to
seasonal changes and weather conditions.Comment: In Proc. CVPR 201
Pseudo Mask Augmented Object Detection
In this work, we present a novel and effective framework to facilitate object
detection with the instance-level segmentation information that is only
supervised by bounding box annotation. Starting from the joint object detection
and instance segmentation network, we propose to recursively estimate the
pseudo ground-truth object masks from the instance-level object segmentation
network training, and then enhance the detection network with top-down
segmentation feedbacks. The pseudo ground truth mask and network parameters are
optimized alternatively to mutually benefit each other. To obtain the promising
pseudo masks in each iteration, we embed a graphical inference that
incorporates the low-level image appearance consistency and the bounding box
annotations to refine the segmentation masks predicted by the segmentation
network. Our approach progressively improves the object detection performance
by incorporating the detailed pixel-wise information learned from the
weakly-supervised segmentation network. Extensive evaluation on the detection
task in PASCAL VOC 2007 and 2012 [12] verifies that the proposed approach is
effective
- …