85 research outputs found

    Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks

    Full text link
    It is common to implicitly assume access to intelligently captured inputs (e.g., photos from a human photographer), yet autonomously capturing good observations is itself a major challenge. We address the problem of learning to look around: if a visual agent has the ability to voluntarily acquire new views to observe its environment, how can it learn efficient exploratory behaviors to acquire informative observations? We propose a reinforcement learning solution, where the agent is rewarded for actions that reduce its uncertainty about the unobserved portions of its environment. Based on this principle, we develop a recurrent neural network-based approach to perform active completion of panoramic natural scenes and 3D object shapes. Crucially, the learned policies are not tied to any recognition task nor to the particular semantic content seen during training. As a result, 1) the learned "look around" behavior is relevant even for new tasks in unseen environments, and 2) training data acquisition involves no manual labeling. Through tests in diverse settings, we demonstrate that our approach learns useful generic policies that transfer to new unseen tasks and environments. Completion episodes are shown at https://goo.gl/BgWX3W

    Slow and steady feature analysis: higher order temporal coherence in video

    Full text link
    How can unlabeled video augment visual learning? Existing methods perform "slow" feature analysis, encouraging the representations of temporally close frames to exhibit only small differences. While this standard approach captures the fact that high-level visual signals change slowly over time, it fails to capture *how* the visual content changes. We propose to generalize slow feature analysis to "steady" feature analysis. The key idea is to impose a prior that higher order derivatives in the learned feature space must be small. To this end, we train a convolutional neural network with a regularizer on tuples of sequential frames from unlabeled video. It encourages feature changes over time to be smooth, i.e., similar to the most recent changes. Using five diverse datasets, including unlabeled YouTube and KITTI videos, we demonstrate our method's impact on object, scene, and action recognition tasks. We further show that our features learned from unlabeled video can even surpass a standard heavily supervised pretraining approach.Comment: in Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas, NV, June 201

    Zero Shot Recognition with Unreliable Attributes

    Full text link
    In principle, zero-shot learning makes it possible to train a recognition model simply by specifying the category's attributes. For example, with classifiers for generic attributes like \emph{striped} and \emph{four-legged}, one can construct a classifier for the zebra category by enumerating which properties it possesses---even without providing zebra training images. In practice, however, the standard zero-shot paradigm suffers because attribute predictions in novel images are hard to get right. We propose a novel random forest approach to train zero-shot models that explicitly accounts for the unreliability of attribute predictions. By leveraging statistics about each attribute's error tendencies, our method obtains more robust discriminative models for the unseen classes. We further devise extensions to handle the few-shot scenario and unreliable attribute descriptions. On three datasets, we demonstrate the benefit for visual category learning with zero or few training examples, a critical domain for rare categories or categories defined on the fly.Comment: NIPS 201

    Causal Confusion in Imitation Learning

    Get PDF
    Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive "causal misidentification" phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventions---either environment interaction or expert queries---to determine the correct causal model. We show that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.Comment: Published at NeurIPS 2019 9 pages, plus references and appendice

    Can Transformers Capture Spatial Relations between Objects?

    Full text link
    Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple "RelatiViT" architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}.Comment: 21 pages, 8 figures, ICLR 202

    Decorrelating Semantic Visual Attributes by Resisting the Urge to Share

    Full text link
    Existing methods to learn visual attributes are prone to learning the wrong thing—namely, properties that are cor-related with the attribute of interest among training sam-ples. Yet, many proposed applications of attributes rely on being able to learn the correct semantic concept corre-sponding to each attribute. We propose to resolve such con-fusions by jointly learning decorrelated, discriminative at-tribute models. Leveraging side information about seman-tic relatedness, we develop a multi-task learning approach that uses structured sparsity to encourage feature competi-tion among unrelated attributes and feature sharing among related attributes. On three challenging datasets, we show that accounting for structure in the visual attribute space is key to learning attribute models that preserve semantics, yielding improved generalizability that helps in the recog-nition and discovery of unseen object categories. 1
    • …
    corecore