2,170 research outputs found

    Learning Models for Following Natural Language Directions in Unknown Environments

    Get PDF
    Natural language offers an intuitive and flexible means for humans to communicate with the robots that we will increasingly work alongside in our homes and workplaces. Recent advancements have given rise to robots that are able to interpret natural language manipulation and navigation commands, but these methods require a prior map of the robot's environment. In this paper, we propose a novel learning framework that enables robots to successfully follow natural language route directions without any previous knowledge of the environment. The algorithm utilizes spatial and semantic information that the human conveys through the command to learn a distribution over the metric and semantic properties of spatially extended environments. Our method uses this distribution in place of the latent world model and interprets the natural language instruction as a distribution over the intended behavior. A novel belief space planner reasons directly over the map and behavior distributions to solve for a policy using imitation learning. We evaluate our framework on a voice-commandable wheelchair. The results demonstrate that by learning and performing inference over a latent environment model, the algorithm is able to successfully follow natural language route directions within novel, extended environments.Comment: ICRA 201

    Learning with Latent Language

    Full text link
    The named concepts and compositional operators present in natural language provide a rich source of information about the kinds of abstractions humans use to navigate the world. Can this linguistic background knowledge improve the generality and efficiency of learned classifiers and control policies? This paper aims to show that using the space of natural language strings as a parameter space is an effective way to capture natural task structure. In a pretraining phase, we learn a language interpretation model that transforms inputs (e.g. images) into outputs (e.g. labels) given natural language descriptions. To learn a new concept (e.g. a classifier), we search directly in the space of descriptions to minimize the interpreter's loss on training examples. Crucially, our models do not require language data to learn these concepts: language is used only in pretraining to impose structure on subsequent learning. Results on image classification, text editing, and reinforcement learning show that, in all settings, models with a linguistic parameterization outperform those without

    Weakly-supervised learning of visual relations

    Full text link
    This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset

    Weakly-supervised learning of visual relations

    Get PDF
    This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset

    Pushing the limits of Visual Grounding: Pre-training on large synthetic datasets

    Get PDF
    openVisual Grounding is a crucial computer vision task requiring a deep understanding of data semantics. Leveraging the transformative trend of training controllable generative models, the research aims to demonstrate the substantial improvement of state-of-the-art visual grounding models through the use of massive, synthetically generated data. The study crafts a synthetic dataset using controllable generative models, offering a scalable solution to overcome challenges in traditional data collection processes. The study introduces a synthetic dataset, employing controllable generative models for scalability. Evaluating visual grounding model (TransVG) — on the synthetic dataset showcases promising results, with attributes contributing to a diverse dataset of 250,000 samples. The resulting datasets showcases the impact of synthetic data on visual grounding evolution, contributing to advancements in this dynamic field.Visual Grounding is a crucial computer vision task requiring a deep understanding of data semantics. Leveraging the transformative trend of training controllable generative models, the research aims to demonstrate the substantial improvement of state-of-the-art visual grounding models through the use of massive, synthetically generated data. The study crafts a synthetic dataset using controllable generative models, offering a scalable solution to overcome challenges in traditional data collection processes. The study introduces a synthetic dataset, employing controllable generative models for scalability. Evaluating visual grounding model (TransVG) — on the synthetic dataset showcases promising results, with attributes contributing to a diverse dataset of 250,000 samples. The resulting datasets showcases the impact of synthetic data on visual grounding evolution, contributing to advancements in this dynamic field
    • …
    corecore