26,132 research outputs found

    Embodied Question Answering

    Full text link
    We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ("orange"). This challenging task requires a range of AI skills -- active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.Comment: 20 pages, 13 figures, Webpage: https://embodiedqa.org

    ANGELICA : choice of output modality in an embodied agent

    Get PDF
    The ANGELICA project addresses the problem of modality choice in information presentation by embodied, humanlike agents. The output modalities available to such agents include both language and various nonverbal signals such as pointing and gesturing. For each piece of information to be presented by the agent it must be decided whether it should be expressed using language, a nonverbal signal, or both. In the ANGELICA project a model of the different factors influencing this choice will be developed and integrated in a natural language generation system. The application domain is the presentation of route descriptions by an embodied agent in a 3D environment. Evaluation and testing form an integral part of the project. In particular, we will investigate the effect of different modality choices on the effectiveness and naturalness of the generated presentations and on the user's perception of the agent's personality

    Maps, agents and dialogue for exploring a virtual world

    Get PDF
    In previous years we have been involved in several projects in which users (or visitors) had to find their way in information-rich virtual environments. 'Information-rich' means that the users do not know beforehand what is available in the environment, where to go in the environment to find the information and, moreover, users or visitors do not necessarily know exactly what they are looking for. Information-rich means also that the information may change during time. A second visit to the same environment will require different behavior of the visitor in order for him or her to obtain similar information than was available during a previous visit. In this paper we report about two projects and discuss our attempts to generalize from the different approaches and application domains to obtain a library of methods and tools to design and implement intelligent agents that inhabit virtual environments and where the agents support the navigation of the user/visitor

    Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

    Full text link
    A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matterport3D Simulator -- a large-scale reinforcement learning environment based on real imagery. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -- the Room-to-Room (R2R) dataset.Comment: CVPR 2018 Spotlight presentatio
    • …
    corecore