5,149 research outputs found
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
We propose a novel method for learning convolutional neural image
representations without manual supervision. We use motion cues in the form of
optical flow, to supervise representations of static images. The obvious
approach of training a network to predict flow from a single image can be
needlessly difficult due to intrinsic ambiguities in this prediction task. We
instead propose a much simpler learning goal: embed pixels such that the
similarity between their embeddings matches that between their optical flow
vectors. At test time, the learned deep network can be used without access to
video or flow information and transferred to tasks such as image
classification, detection, and segmentation. Our method, which significantly
simplifies previous attempts at using motion for self-supervision, achieves
state-of-the-art results in self-supervision using motion cues, competitive
results for self-supervision in general, and is overall state of the art in
self-supervised pretraining for semantic image segmentation, as demonstrated on
standard benchmarks
Learning Features by Watching Objects Move
This paper presents a novel yet intuitive approach to unsupervised feature
learning. Inspired by the human visual system, we explore whether low-level
motion-based grouping cues can be used to learn an effective visual
representation. Specifically, we use unsupervised motion-based segmentation on
videos to obtain segments, which we use as 'pseudo ground truth' to train a
convolutional network to segment objects from a single frame. Given the
extensive evidence that motion plays a key role in the development of the human
visual system, we hope that this straightforward approach to unsupervised
learning will be more effective than cleverly designed 'pretext' tasks studied
in the literature. Indeed, our extensive experiments show that this is the
case. When used for transfer learning on object detection, our representation
significantly outperforms previous unsupervised approaches across multiple
settings, especially when training data for the target task is scarce.Comment: CVPR 201
Unsupervised Segmentation in Real-World Images via Spelke Object Inference
Self-supervised, category-agnostic segmentation of real-world images is a
challenging open problem in computer vision. Here, we show how to learn static
grouping priors from motion self-supervision by building on the cognitive
science concept of a Spelke Object: a set of physical stuff that moves
together. We introduce the Excitatory-Inhibitory Segment Extraction Network
(EISEN), which learns to extract pairwise affinity graphs for static scenes
from motion-based training signals. EISEN then produces segments from
affinities using a novel graph propagation and competition network. During
training, objects that undergo correlated motion (such as robot arms and the
objects they move) are decoupled by a bootstrapping process: EISEN explains
away the motion of objects it has already learned to segment. We show that
EISEN achieves a substantial improvement in the state of the art for
self-supervised image segmentation on challenging synthetic and real-world
robotics datasets.Comment: 25 pages, 10 figure
Semi-Supervised First-Person Activity Recognition in Body-Worn Video
Body-worn cameras are now commonly used for logging daily life, sports, and
law enforcement activities, creating a large volume of archived footage. This
paper studies the problem of classifying frames of footage according to the
activity of the camera-wearer with an emphasis on application to real-world
police body-worn video. Real-world datasets pose a different set of challenges
from existing egocentric vision datasets: the amount of footage of different
activities is unbalanced, the data contains personally identifiable
information, and in practice it is difficult to provide substantial training
footage for a supervised approach. We address these challenges by extracting
features based exclusively on motion information then segmenting the video
footage using a semi-supervised classification algorithm. On publicly available
datasets, our method achieves results comparable to, if not better than,
supervised and/or deep learning methods using a fraction of the training data.
It also shows promising results on real-world police body-worn video
LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training
Learning object segmentation in image and video datasets without human
supervision is a challenging problem. Humans easily identify moving salient
objects in videos using the gestalt principle of common fate, which suggests
that what moves together belongs together. Building upon this idea, we propose
a self-supervised object discovery approach that leverages motion and
appearance information to produce high-quality object segmentation masks.
Specifically, we redesign the traditional graph cut on images to include motion
information in a linear combination with appearance information to produce edge
weights. Remarkably, this step produces object segmentation masks comparable to
the current state-of-the-art on multiple benchmarks. To further improve
performance, we bootstrap a segmentation network trained on these preliminary
masks as pseudo-ground truths to learn from its own outputs via self-training.
We demonstrate the effectiveness of our approach, named LOCATE, on multiple
standard video object segmentation, image saliency detection, and object
segmentation benchmarks, achieving results on par with and, in many cases
surpassing state-of-the-art methods. We also demonstrate the transferability of
our approach to novel domains through a qualitative study on in-the-wild
images. Additionally, we present extensive ablation analysis to support our
design choices and highlight the contribution of each component of our proposed
method.Comment: Accepted to the British Machine Vision Conference (BMVC) 202
- …