6 research outputs found

    Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments

    Full text link
    Referring to objects in a natural and unambiguous manner is crucial for effective human-robot interaction. Previous research on learning-based referring expressions has focused primarily on comprehension tasks, while generating referring expressions is still mostly limited to rule-based methods. In this work, we propose a two-stage approach that relies on deep learning for estimating spatial relations to describe an object naturally and unambiguously with a referring expression. We compare our method to the state of the art algorithm in ambiguous environments (e.g., environments that include very similar objects with similar relationships). We show that our method generates referring expressions that people find to be more accurate (∼\sim30% better) and would prefer to use (∼\sim32% more often).Comment: International Conference on Intelligent Robots and Systems (IROS 2019), Demo 1: Finding the described object (https://youtu.be/BE6-F6chW0w), Demo 2: Referring to the pointed object (https://youtu.be/nmmv6JUpy8M), Supplementary Video (https://youtu.be/sFjBa_MHS98

    Modeling Human Referring Expression Generation

    Get PDF
    Referring generation expression is a natural language processing task that involves creating noun phrases that identify a referent object to a listener. We evaluate the state-of-the-art Visible Objects Algorithm for referring expression generation presented by Mitchell et. al (2012), and find that it does not perform as well with our natural image set than with the computer-generated image set that was originally used. Further, we analyze over 7,000 referring expressions generated by players of ReferIt Game, an online game that we developed, and by Amazon Mechanical Turk workers to identify metrics with which to create an improved stochastic model that can be coupled with computer vision to mimic human referring expression generation from visual input.Bachelor of Scienc

    From Image to Language and Back Again

    Get PDF
    Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).</jats:p

    Language and Perceptual Categorization in Computational Visual Recognition

    Get PDF
    Computational visual recognition or giving computers the ability to understand images as well as humans do is a core problem in Computer Vision. Traditional recognition systems often describe visual content by producing a set of isolated labels, object locations, or by even trying to annotate every pixel in an image with a category. People instead describe the visual world using language. The rich visually descriptive language produced by people incorporates information from human intuition, world knowledge, visual saliency, and common sense that go beyond detecting individual visual concepts like objects, attributes, or scenes. Moreover, due to the rising popularity of social media, there exist billions of images with associated text on the web, yet systems that can leverage this type of annotations or try to connect language and vision are scarce. In this dissertation, we propose new approaches that explore the connections between language and vision at several levels of detail by combining techniques from Computer Vision and Natural Language Understanding. We first present a data-driven technique for understanding and generating image descriptions using natural language, including automatically collecting a big-scale dataset of images with visually descriptive captions. Then we introduce a system for retrieving short visually descriptive phrases for describing some part or aspect of an image, and a simple technique to generate full image descriptions by stitching short phrases. Next we introduce an approach for collecting and generating referring expressions for objects in natural scenes at a much larger scale than previous studies. Finally, we describe methods for learning how to name objects by using intuitions from perceptual categorization related to basic-level and entry-level categories. The main contribution of this thesis is in advancing our knowledge on how to leverage language and intuitions from human perception to create visual recognition systems that can better learn from and communicate with people.Doctor of Philosoph

    Graphs and spatial relations in the generation of referring expressions

    No full text

    Graphs and spatial relations in the generation of referring expressions

    No full text
    When they introduced the Graph-Based Algorithm (GBA) for referring expression generation, Krahmer et al. (2003) flaunted the natural way in which it deals with relations between objects; but this feature has never been tested empirically. We fill this gap in this paper, exploring referring expression generation from the perspective of the GBA and focusing in particular on generating human-like expressions in visual scenes with spatial relations. We compare the original GBA against a variant that we introduce to better reflect human reference, and find that although the original GBA performs reasonably well, our new algorithm offers an even better match to human data (77.91 % Dice). Further, it can be extended to capture speaker variation, reaching an 82.83 % Dice overlap with human-produced expressions
    corecore