16 research outputs found

    Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments

    Full text link
    Referring to objects in a natural and unambiguous manner is crucial for effective human-robot interaction. Previous research on learning-based referring expressions has focused primarily on comprehension tasks, while generating referring expressions is still mostly limited to rule-based methods. In this work, we propose a two-stage approach that relies on deep learning for estimating spatial relations to describe an object naturally and unambiguously with a referring expression. We compare our method to the state of the art algorithm in ambiguous environments (e.g., environments that include very similar objects with similar relationships). We show that our method generates referring expressions that people find to be more accurate (\sim30% better) and would prefer to use (\sim32% more often).Comment: International Conference on Intelligent Robots and Systems (IROS 2019), Demo 1: Finding the described object (https://youtu.be/BE6-F6chW0w), Demo 2: Referring to the pointed object (https://youtu.be/nmmv6JUpy8M), Supplementary Video (https://youtu.be/sFjBa_MHS98

    Searching for Ambiguous Objects in Videos using Relational Referring Expressions

    Full text link
    Humans frequently use referring (identifying) expressions to refer to objects. Especially in ambiguous settings, humans prefer expressions (called relational referring expressions) that describe an object with respect to a distinguishing, unique object. Unlike studies on video object search using referring expressions, in this paper, our focus is on (i) relational referring expressions in highly ambiguous settings, and (ii) methods that can both generate and comprehend a referring expression. For this goal, we first introduce a new dataset for video object search with referring expressions that includes numerous copies of the objects, making it difficult to use non-relational expressions. Moreover, we train two baseline deep networks on this dataset, which show promising results. Finally, we propose a deep attention network that significantly outperforms the baselines on our dataset. The dataset and the codes are available at https://github.com/hazananayurt/viref.Comment: BMVC 2019 camera read

    Talking about Relations:Factors Influencing the Production of Relational Descriptions

    Get PDF
    In a production experiment (Experiment 1) and an acceptability rating one (Experiment 2), we assessed two factors, spatial position and salience, which may influence the production of relational descriptions (such as the ball between the man and the drawer). In Experiment 1, speakers were asked to refer unambiguously to a target object (a ball). In Experiment 1a, we addressed the role of spatial position, more specifically if speakers mention the entity positioned leftmost in the scene as (first) relatum. The results showed a preference to start with the left entity, however, only as a trend, which leaves room for other factors that could influence spatial reference. Thus, in the following studies, we varied salience systematically, by making one of the relatum candidates animate (Experiment 1b), and by adding attention capture cues, first subliminally by priming one relatum candidate with a flash (Experiment 1c), then explicitly by using salient colors for objects (Experiment 1d). Results indicate that spatial position played a dominant role. Entities on the left were mentioned more often as (first) relatum than those on the right (Experiment 1a, 1b, 1c, 1d). Animacy affected reference production in one out of three studies (in Experiment 1d). When salience was manipulated by priming visual attention or by using salient colors, there were no significant effects (Experiment 1c, 1d). In the acceptability rating study (Experiment 2), participants expressed their preference for specific relata, by ranking descriptions on the basis of how good they thought the descriptions fitted the scene. Results show that participants preferred most the description that had an animate entity as the first mentioned relatum. The relevance of these results for models of reference production is discussed

    QUESTION ANSWERING, GROUNDING, AND GENERATION FOR VISION AND LANGUAGE

    Get PDF
    One ultimate goal of AI is to develop an artificial intelligent (AI) system that can communicate with people in a natural way. Such communication includes but is not limited to asking we humans questions, answering our questions, conducting dialogue with human beings, and performing some actions to better serve people. Imagine in the future where the service robot is everywhere, and we could ask our home robot to “grab me the red cup on the table.” To perform this command, the AI system needs to understand this spoken English sentence, perceive the visual world, navigate to the right place “table”, recognize the right object “the red cup”, then grab it and finally return it back to the commander. Just for this single command, it already involves many techniques, such as speech recognition, language understanding, scene understanding, embodied navigation, object recognition, pose estimation, robot manipulation, etc. Each of these techniques are not well solved yet, but we are on a rapid way toward the success. This thesis is in advancing our knowledge to explore various connections between vision, language and even beyond to push forward this ultimate goal. We study 3 popular vision and language tasks, including visual question answering, language grounding, and image-to-text language generation. Inside each, we will introduce our proposed novel task, accompanied with high-quality dataset and well-performing data-driven approaches. Specifically, we first introduce Visual Madlibs for image-based and region-based question answering. Then we introduce referring expressions, where we study both referring expression comprehension and generation, covering both language grounding and generation. Next, we study album summarization, which not only selects the key photos inside an album but also generates a natural language story describing the whole album. Last but not least, we describe multi-target embodied question answering, a task that is even closer to our ultimate goal that requires both language understanding and navigation ability from the AI system.Doctor of Philosoph

    The effectiveness of dynamically processed incremental descriptions in human robot interaction

    Get PDF
    We explore the effectiveness of a dynamically processed incremental referring description system using under-specified ambiguous descriptions that are then built upon using linguistic repair statements, which we refer to as a dynamic system. We build a dynamically processed incremental referring description generation system that is able to provide contextual navigational statements to describe an object in a potential real-world situation of nuclear waste sorting and maintenance. In a study of 31 participants, we test the dynamic system in a case where a user is remote operating a robot to sort nuclear waste, with the robot assisting them in identifying the correct barrels to be removed. We compare these against a static non-ambiguous description given in the same scenario. As well as looking at efficiency with time and distance measurements, we also look at user preference. Results show that our dynamic system was a much more efficient method—taking only 62% of the time on average—for finding the correct barrel. Participants also favoured our dynamic system
    corecore