117,601 research outputs found

    Automatic generation of natural language descriptions of visual data: describing images and videos using recurrent and self-attentive models

    Get PDF
    Humans are faced with a constant flow of visual stimuli, e.g., from the environment or when looking at social media. In contrast, visually-impaired people are often incapable to perceive and process this advantageous and beneficial information that could help maneuver them through everyday situations and activities. However, audible feedback such as natural language can give them the ability to better be aware of their surroundings, thus enabling them to autonomously master everyday's challenges. One possibility to create audible feedback is to produce natural language descriptions for visual data such as still images and then read this text to the person. Moreover, textual descriptions for images can be further utilized for text analysis (e.g., sentiment analysis) and information aggregation. In this work, we investigate different approaches and techniques for the automatic generation of natural language of visual data such as still images and video clips. In particular, we look at language models that generate textual descriptions with recurrent neural networks: First, we present a model that allows to generate image captions for scenes that depict interactions between humans and branded products. Thereby, we focus on the correct identification of the brand name in a multi-task training setting and present two new metrics that allow us to evaluate this requirement. Second, we explore the automatic answering of questions posed for an image. In fact, we propose a model that generates answers from scratch instead of predicting an answer from a limited set of possible answers. In comparison to related works, we are therefore able to generate rare answers, which are not contained in the pool of frequent answers. Third, we review the automatic generation of doctors' reports for chest X-ray images. That is, we introduce a model that can cope with a dataset bias of medical datasets (i.e., abnormal cases are very rare) and generates reports with a hierarchical recurrent model. We also investigate the correlation between the distinctiveness of the report and the score in traditional metrics and find a discrepancy between good scores and accurate reports. Then, we examine self-attentive language models that improve computational efficiency and performance over the recurrent models. Specifically, we utilize the Transformer architecture. First, we expand the automatic description generation to the domain of videos where we present a video-to-text (VTT) model that can easily synchronize audio-visual features. With an extensive experimental exploration, we verify the effectiveness of our video-to-text translation pipeline. Finally, we revisit our recurrent models with this self-attentive approach

    Towards Succinct and Relevant Image Descriptions

    Get PDF
    What does it mean to produce a good description of an image? Is a description good because it correctly identifies all of the objects in the image, because it describes the interesting attributes of the objects, or because it is short, yet informative? Grice’s Cooperative Principle, stated as “Make your contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged ” (Grice, 1975), alongside other ideas of pragmatics in communication, have proven useful in thinking about language generation (Hovy, 1987; McKeown et al., 1995). The Cooperative Principle provides one possible framework for thinking about the generation and evaluation of image descriptions.1 The immediate question is whether automatic image description is within the scope of the Cooperative Principle. Consider the task of searching for images using natural language, where the purpose of the exchange is for the user to quickly and accurately find images that match their information needs. In this scenario, the user formulates a complete sentence query to express their needs, e.g. A sheepdog chasing sheep in a field, and initiates an exchange with the system in the form of a sequence of one-shot con-versations. In this exchange, both participants can describe images in natural language, and a successful outcome relies on each participant succinctly and correctly expressing their beliefs about the images. I

    Confluence of Vision and Natural Language Processing for Cross-media Semantic Relations Extraction

    Get PDF
    In this dissertation, we focus on extracting and understanding semantically meaningful relationships between data items of various modalities; especially relations between images and natural language. We explore the ideas and techniques to integrate such cross-media semantic relations for machine understanding of large heterogeneous datasets, made available through the expansion of the World Wide Web. The datasets collected from social media websites, news media outlets and blogging platforms usually contain multiple modalities of data. Intelligent systems are needed to automatically make sense out of these datasets and present them in such a way that humans can find the relevant pieces of information or get a summary of the available material. Such systems have to process multiple modalities of data such as images, text, linguistic features, and structured data in reference to each other. For example, image and video search and retrieval engines are required to understand the relations between visual and textual data so that they can provide relevant answers in the form of images and videos to the users\u27 queries presented in the form of text. We emphasize the automatic extraction of semantic topics or concepts from the data available in any form such as images, free-flowing text or metadata. These semantic concepts/topics become the basis of semantic relations across heterogeneous data types, e.g., visual and textual data. A classic problem involving image-text relations is the automatic generation of textual descriptions of images. This problem is the main focus of our work. In many cases, large amount of text is associated with images. Deep exploration of linguistic features of such text is required to fully utilize the semantic information encoded in it. A news dataset involving images and news articles is an example of this scenario. We devise frameworks for automatic news image description generation based on the semantic relations of images, as well as semantic understanding of linguistic features of the news articles

    Generating Natural Questions About an Image

    Full text link
    There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.Comment: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistic
    • …
    corecore