19,366 research outputs found
Emergence of Object Segmentation in Perturbed Generative Models
We introduce a novel framework to build a model that can learn how to segment
objects from a collection of images without any human annotation. Our method
builds on the observation that the location of object segments can be perturbed
locally relative to a given background without affecting the realism of a
scene. Our approach is to first train a generative model of a layered scene.
The layered representation consists of a background image, a foreground image
and the mask of the foreground. A composite image is then obtained by
overlaying the masked foreground image onto the background. The generative
model is trained in an adversarial fashion against a discriminator, which
forces the generative model to produce realistic composite images. To force the
generator to learn a representation where the foreground layer corresponds to
an object, we perturb the output of the generative model by introducing a
random shift of both the foreground image and mask relative to the background.
Because the generator is unaware of the shift before computing its output, it
must produce layered representations that are realistic for any such random
perturbation. Finally, we learn to segment an image by defining an autoencoder
consisting of an encoder, which we train, and the pre-trained generator as the
decoder, which we freeze. The encoder maps an image to a feature vector, which
is fed as input to the generator to give a composite image matching the
original input image. Because the generator outputs an explicit layered
representation of the scene, the encoder learns to detect and segment objects.
We demonstrate this framework on real images of several object categories.Comment: 33rd Conference on Neural Information Processing Systems (NeurIPS
2019), Spotlight presentatio
A robust motion estimation and segmentation approach to represent moving images with layers
The paper provides a robust representation of moving images based on layers. To that goal, we have designed efficient motion estimation and segmentation techniques by affine model fitting suitable for the construction of layers. Layered representations, originally introduced by Wang and Adelson (see IEEE Transactions on Image Processing, vol.3, no.5, p.625-38, 1994) are important in several applications. In particular they are very appropriate for object tracking, object manipulation and content-based scalability which are among the main functionalities of the future MPEG-4 standard. In addition a variety of examples are provided that give a deep insight into the performance bounds of the representation of moving images using layers.Peer ReviewedPostprint (published version
Unsupervised Discovery of Parts, Structure, and Dynamics
Humans easily recognize object parts and their hierarchical structure by
watching how they move; they can then predict how each part moves in the
future. In this paper, we propose a novel formulation that simultaneously
learns a hierarchical, disentangled object representation and a dynamics model
for object parts from unlabeled videos. Our Parts, Structure, and Dynamics
(PSD) model learns to, first, recognize the object parts via a layered image
representation; second, predict hierarchy via a structural descriptor that
composes low-level concepts into a hierarchical structure; and third, model the
system dynamics by predicting the future. Experiments on multiple real and
synthetic datasets demonstrate that our PSD model works well on all three
tasks: segmenting object parts, building their hierarchical structure, and
capturing their motion distributions.Comment: ICLR 2019. The first two authors contributed equally to this wor
Cross Modal Distillation for Supervision Transfer
In this work we propose a technique that transfers supervision between images
from different modalities. We use learned representations from a large labeled
modality as a supervisory signal for training representations for a new
unlabeled paired modality. Our method enables learning of rich representations
for unlabeled modalities and can be used as a pre-training procedure for new
modalities with limited labeled data. We show experimental results where we
transfer supervision from labeled RGB images to unlabeled depth and optical
flow images and demonstrate large improvements for both these cross modal
supervision transfers. Code, data and pre-trained models are available at
https://github.com/s-gupta/fast-rcnn/tree/distillationComment: Updated version (v2) contains additional experiments and result
Layered Interpretation of Street View Images
We propose a layered street view model to encode both depth and semantic
information on street view images for autonomous driving. Recently, stixels,
stix-mantics, and tiered scene labeling methods have been proposed to model
street view images. We propose a 4-layer street view model, a compact
representation over the recently proposed stix-mantics model. Our layers encode
semantic classes like ground, pedestrians, vehicles, buildings, and sky in
addition to the depths. The only input to our algorithm is a pair of stereo
images. We use a deep neural network to extract the appearance features for
semantic classes. We use a simple and an efficient inference algorithm to
jointly estimate both semantic classes and layered depth values. Our method
outperforms other competing approaches in Daimler urban scene segmentation
dataset. Our algorithm is massively parallelizable, allowing a GPU
implementation with a processing speed about 9 fps.Comment: The paper will be presented in the 2015 Robotics: Science and Systems
Conference (RSS
- …