43,572 research outputs found
Learning Features by Watching Objects Move
This paper presents a novel yet intuitive approach to unsupervised feature
learning. Inspired by the human visual system, we explore whether low-level
motion-based grouping cues can be used to learn an effective visual
representation. Specifically, we use unsupervised motion-based segmentation on
videos to obtain segments, which we use as 'pseudo ground truth' to train a
convolutional network to segment objects from a single frame. Given the
extensive evidence that motion plays a key role in the development of the human
visual system, we hope that this straightforward approach to unsupervised
learning will be more effective than cleverly designed 'pretext' tasks studied
in the literature. Indeed, our extensive experiments show that this is the
case. When used for transfer learning on object detection, our representation
significantly outperforms previous unsupervised approaches across multiple
settings, especially when training data for the target task is scarce.Comment: CVPR 201
Self-Supervised Relative Depth Learning for Urban Scene Understanding
As an agent moves through the world, the apparent motion of scene elements is
(usually) inversely proportional to their depth. It is natural for a learning
agent to associate image patterns with the magnitude of their displacement over
time: as the agent moves, faraway mountains don't move much; nearby trees move
a lot. This natural relationship between the appearance of objects and their
motion is a rich source of information about the world. In this work, we start
by training a deep network, using fully automatic supervision, to predict
relative scene depth from single images. The relative depth training images are
automatically derived from simple videos of cars moving through a scene, using
recent motion segmentation techniques, and no human-provided labels. This proxy
task of predicting relative depth from a single image induces features in the
network that result in large improvements in a set of downstream tasks
including semantic segmentation, joint road segmentation and car detection, and
monocular (absolute) depth estimation, over a network trained from scratch. The
improvement on the semantic segmentation task is greater than those produced by
any other automatically supervised methods. Moreover, for monocular depth
estimation, our unsupervised pre-training method even outperforms supervised
pre-training with ImageNet. In addition, we demonstrate benefits from learning
to predict (unsupervised) relative depth in the specific videos associated with
various downstream tasks. We adapt to the specific scenes in those tasks in an
unsupervised manner to improve performance. In summary, for semantic
segmentation, we present state-of-the-art results among methods that do not use
supervised pre-training, and we even exceed the performance of supervised
ImageNet pre-trained models for monocular depth estimation, achieving results
that are comparable with state-of-the-art methods
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
We propose a novel method for learning convolutional neural image
representations without manual supervision. We use motion cues in the form of
optical flow, to supervise representations of static images. The obvious
approach of training a network to predict flow from a single image can be
needlessly difficult due to intrinsic ambiguities in this prediction task. We
instead propose a much simpler learning goal: embed pixels such that the
similarity between their embeddings matches that between their optical flow
vectors. At test time, the learned deep network can be used without access to
video or flow information and transferred to tasks such as image
classification, detection, and segmentation. Our method, which significantly
simplifies previous attempts at using motion for self-supervision, achieves
state-of-the-art results in self-supervision using motion cues, competitive
results for self-supervision in general, and is overall state of the art in
self-supervised pretraining for semantic image segmentation, as demonstrated on
standard benchmarks
Multi-Task Video Captioning with Video and Entailment Generation
Video captioning, the task of describing the content of a video, has seen
some promising improvements in recent years with sequence-to-sequence models,
but accurately learning the temporal and logical dynamics involved in the task
still remains a challenge, especially given the lack of sufficient annotated
data. We improve video captioning by sharing knowledge with two related
directed-generation tasks: a temporally-directed unsupervised video prediction
task to learn richer context-aware video encoder representations, and a
logically-directed language entailment generation task to learn better
video-entailed caption decoder representations. For this, we present a
many-to-many multi-task learning model that shares parameters across the
encoders and decoders of the three tasks. We achieve significant improvements
and the new state-of-the-art on several standard video captioning datasets
using diverse automatic and human evaluations. We also show mutual multi-task
improvements on the entailment generation task.Comment: ACL 2017 (14 pages w/ supplementary
- …