67,705 research outputs found

    Object Referring in Visual Scene with Spoken Language

    Full text link
    Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-modality learning. The approach is developed by carefully taking down ORSpoken problem into three sub-problems and introducing task-specific vision-language interactions at the corresponding levels. Experiments show that our method outperforms competing methods consistently and significantly. The approach is also evaluated in the presence of audio noise, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201

    QUESTION ANSWERING, GROUNDING, AND GENERATION FOR VISION AND LANGUAGE

    Get PDF
    One ultimate goal of AI is to develop an artificial intelligent (AI) system that can communicate with people in a natural way. Such communication includes but is not limited to asking we humans questions, answering our questions, conducting dialogue with human beings, and performing some actions to better serve people. Imagine in the future where the service robot is everywhere, and we could ask our home robot to “grab me the red cup on the table.” To perform this command, the AI system needs to understand this spoken English sentence, perceive the visual world, navigate to the right place “table”, recognize the right object “the red cup”, then grab it and finally return it back to the commander. Just for this single command, it already involves many techniques, such as speech recognition, language understanding, scene understanding, embodied navigation, object recognition, pose estimation, robot manipulation, etc. Each of these techniques are not well solved yet, but we are on a rapid way toward the success. This thesis is in advancing our knowledge to explore various connections between vision, language and even beyond to push forward this ultimate goal. We study 3 popular vision and language tasks, including visual question answering, language grounding, and image-to-text language generation. Inside each, we will introduce our proposed novel task, accompanied with high-quality dataset and well-performing data-driven approaches. Specifically, we first introduce Visual Madlibs for image-based and region-based question answering. Then we introduce referring expressions, where we study both referring expression comprehension and generation, covering both language grounding and generation. Next, we study album summarization, which not only selects the key photos inside an album but also generates a natural language story describing the whole album. Last but not least, we describe multi-target embodied question answering, a task that is even closer to our ultimate goal that requires both language understanding and navigation ability from the AI system.Doctor of Philosoph

    Object Referring in Videos with Language and Human Gaze

    Full text link
    We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previousOR methods. For dataset and code, please refer https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure

    Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions

    Full text link
    Comprehension of spoken natural language is an essential component for robots to communicate with human effectively. However, handling unconstrained spoken instructions is challenging due to (1) complex structures including a wide variety of expressions used in spoken language and (2) inherent ambiguity in interpretation of human instructions. In this paper, we propose the first comprehensive system that can handle unconstrained spoken language and is able to effectively resolve ambiguity in spoken instructions. Specifically, we integrate deep-learning-based object detection together with natural language processing technologies to handle unconstrained spoken instructions, and propose a method for robots to resolve instruction ambiguity through dialogue. Through our experiments on both a simulated environment as well as a physical industrial robot arm, we demonstrate the ability of our system to understand natural instructions from human operators effectively, and how higher success rates of the object picking task can be achieved through an interactive clarification process.Comment: 9 pages. International Conference on Robotics and Automation (ICRA) 2018. Accompanying videos are available at the following links: https://youtu.be/_Uyv1XIUqhk (the system submitted to ICRA-2018) and http://youtu.be/DGJazkyw0Ws (with improvements after ICRA-2018 submission

    Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments

    Full text link
    Referring to objects in a natural and unambiguous manner is crucial for effective human-robot interaction. Previous research on learning-based referring expressions has focused primarily on comprehension tasks, while generating referring expressions is still mostly limited to rule-based methods. In this work, we propose a two-stage approach that relies on deep learning for estimating spatial relations to describe an object naturally and unambiguously with a referring expression. We compare our method to the state of the art algorithm in ambiguous environments (e.g., environments that include very similar objects with similar relationships). We show that our method generates referring expressions that people find to be more accurate (\sim30% better) and would prefer to use (\sim32% more often).Comment: International Conference on Intelligent Robots and Systems (IROS 2019), Demo 1: Finding the described object (https://youtu.be/BE6-F6chW0w), Demo 2: Referring to the pointed object (https://youtu.be/nmmv6JUpy8M), Supplementary Video (https://youtu.be/sFjBa_MHS98

    The iconicity advantage in sign production: The case of bimodal bilinguals

    Get PDF
    Recent evidence demonstrates that pictures corresponding to iconic signs are named faster than pictures corresponding to non-iconic signs. The present study investigates the locus of the iconicity advantage in hearing bimodal bilinguals. A naming experiment with iconic and noniconic pictures in Italian Sign Language (LIS) was conducted. Bimodal bilinguals named the pictures either using a noun construction that involved the production of the sign corresponding to the picture or using a marked demonstrative pronoun construction replacing the picture sign. In this last condition, the pictures were colored and participants were instructed to name the pronoun together with the color. The iconicity advantage was reliable in the noun utterance but not in the marked demonstrative pronoun utterance. In a third condition, the colored pictures were presented as distractor stimuli and participants required to name the color. In this last condition, distractor pictures with iconic signs elicited faster naming latencies than non-iconic signs. The results suggest that the advantage of iconic signs in production arises at the level of semantic-tophonological links. In addition, we conclude that bimodal bilinguals and native signers do not differ in terms of the activation flow within the sign production system
    corecore