85 research outputs found
Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks
It is common to implicitly assume access to intelligently captured inputs
(e.g., photos from a human photographer), yet autonomously capturing good
observations is itself a major challenge. We address the problem of learning to
look around: if a visual agent has the ability to voluntarily acquire new views
to observe its environment, how can it learn efficient exploratory behaviors to
acquire informative observations? We propose a reinforcement learning solution,
where the agent is rewarded for actions that reduce its uncertainty about the
unobserved portions of its environment. Based on this principle, we develop a
recurrent neural network-based approach to perform active completion of
panoramic natural scenes and 3D object shapes. Crucially, the learned policies
are not tied to any recognition task nor to the particular semantic content
seen during training. As a result, 1) the learned "look around" behavior is
relevant even for new tasks in unseen environments, and 2) training data
acquisition involves no manual labeling. Through tests in diverse settings, we
demonstrate that our approach learns useful generic policies that transfer to
new unseen tasks and environments. Completion episodes are shown at
https://goo.gl/BgWX3W
Slow and steady feature analysis: higher order temporal coherence in video
How can unlabeled video augment visual learning? Existing methods perform
"slow" feature analysis, encouraging the representations of temporally close
frames to exhibit only small differences. While this standard approach captures
the fact that high-level visual signals change slowly over time, it fails to
capture *how* the visual content changes. We propose to generalize slow feature
analysis to "steady" feature analysis. The key idea is to impose a prior that
higher order derivatives in the learned feature space must be small. To this
end, we train a convolutional neural network with a regularizer on tuples of
sequential frames from unlabeled video. It encourages feature changes over time
to be smooth, i.e., similar to the most recent changes. Using five diverse
datasets, including unlabeled YouTube and KITTI videos, we demonstrate our
method's impact on object, scene, and action recognition tasks. We further show
that our features learned from unlabeled video can even surpass a standard
heavily supervised pretraining approach.Comment: in Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas,
NV, June 201
Zero Shot Recognition with Unreliable Attributes
In principle, zero-shot learning makes it possible to train a recognition
model simply by specifying the category's attributes. For example, with
classifiers for generic attributes like \emph{striped} and \emph{four-legged},
one can construct a classifier for the zebra category by enumerating which
properties it possesses---even without providing zebra training images. In
practice, however, the standard zero-shot paradigm suffers because attribute
predictions in novel images are hard to get right. We propose a novel random
forest approach to train zero-shot models that explicitly accounts for the
unreliability of attribute predictions. By leveraging statistics about each
attribute's error tendencies, our method obtains more robust discriminative
models for the unseen classes. We further devise extensions to handle the
few-shot scenario and unreliable attribute descriptions. On three datasets, we
demonstrate the benefit for visual category learning with zero or few training
examples, a critical domain for rare categories or categories defined on the
fly.Comment: NIPS 201
Recommended from our members
Embodied learning for visual recognition
The field of visual recognition in recent years has come to rely on large expensively curated and manually labeled "bags of disembodied images". In the wake of this, my focus has been on understanding and exploiting alternate "free" sources of supervision available to visual learning agents that are situated within real environments. For example, even simply moving from orderless image collections to continuous visual observations offers opportunities to understand the dynamics and other physical properties of the visual world. Further, embodied agents may have the abilities to move around their environment and/or effect changes within it, in which case these abilities offer new means to acquire useful supervision. In this dissertation, I present my work along this and related directions.Electrical and Computer Engineerin
Causal Confusion in Imitation Learning
Behavioral cloning reduces policy learning to supervised learning by training
a discriminative model to predict expert actions given observations. Such
discriminative models are non-causal: the training procedure is unaware of the
causal structure of the interaction between the expert and the environment. We
point out that ignoring causality is particularly damaging because of the
distributional shift in imitation learning. In particular, it leads to a
counter-intuitive "causal misidentification" phenomenon: access to more
information can yield worse performance. We investigate how this problem
arises, and propose a solution to combat it through targeted
interventions---either environment interaction or expert queries---to determine
the correct causal model. We show that causal misidentification occurs in
several benchmark control domains as well as realistic driving settings, and
validate our solution against DAgger and other baselines and ablations.Comment: Published at NeurIPS 2019 9 pages, plus references and appendice
Can Transformers Capture Spatial Relations between Objects?
Spatial relationships between objects represent key scene information for
humans to understand and interact with the world. To study the capability of
current computer vision systems to recognize physically grounded spatial
relations, we start by proposing precise relation definitions that permit
consistently annotating a benchmark dataset. Despite the apparent simplicity of
this task relative to others in the recognition literature, we observe that
existing approaches perform poorly on this benchmark. We propose new approaches
exploiting the long-range attention capabilities of transformers for this task,
and evaluating key design principles. We identify a simple "RelatiViT"
architecture and demonstrate that it outperforms all current approaches. To our
knowledge, this is the first method to convincingly outperform naive baselines
on spatial relation prediction in in-the-wild settings. The code and datasets
are available in \url{https://sites.google.com/view/spatial-relation}.Comment: 21 pages, 8 figures, ICLR 202
Decorrelating Semantic Visual Attributes by Resisting the Urge to Share
Existing methods to learn visual attributes are prone to learning the wrong thing—namely, properties that are cor-related with the attribute of interest among training sam-ples. Yet, many proposed applications of attributes rely on being able to learn the correct semantic concept corre-sponding to each attribute. We propose to resolve such con-fusions by jointly learning decorrelated, discriminative at-tribute models. Leveraging side information about seman-tic relatedness, we develop a multi-task learning approach that uses structured sparsity to encourage feature competi-tion among unrelated attributes and feature sharing among related attributes. On three challenging datasets, we show that accounting for structure in the visual attribute space is key to learning attribute models that preserve semantics, yielding improved generalizability that helps in the recog-nition and discovery of unseen object categories. 1
- …