246 research outputs found

    The Long-Short Story of Movie Description

    Full text link
    Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII Movie Description allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the challenging setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these visual classifiers we learn how to generate a description using an LSTM. We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD dataset. We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task

    Movie Description

    Get PDF
    Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015

    Automatically Generating Natural Language Descriptions of Images by a Deep Hierarchical Framework.

    Get PDF
    Automatically generating an accurate and meaningful description of an image is very challenging. However, the recent scheme of generating an image caption by maximizing the likelihood of target sentences lacks the capacity of recognizing the human-object interaction (HOI) and semantic relationship between HOIs and scenes, which are the essential parts of an image caption. This article proposes a novel two-phase framework to generate an image caption by addressing the above challenges: 1) a hybrid deep learning and 2) an image description generation. In the hybrid deep-learning phase, a novel factored three-way interaction machine was proposed to learn the relational features of the human-object pairs hierarchically. In this way, the image recognition problem is transformed into a latent structured labeling task. In the image description generation phase, a lexicalized probabilistic context-free tree growing scheme is innovatively integrated with a description generator to transform the descriptions generation task into a syntactic-tree generation process. Extensively comparing state-of-the-art image captioning methods on benchmark datasets, we demonstrated that our proposed framework outperformed the existing captioning methods in different ways, such as significantly improving the performance of the HOI and relationships between HOIs and scenes (RHIS) predictions, and quality of generated image captions in a semantically and structurally coherent manner.\enlargethispage-8pt

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    Remote sensing image captioning with pre-trained transformer models

    Get PDF
    Remote sensing images, and the unique properties that characterize them, are attracting increased attention from computer vision researchers, largely due to their many possible applications. The area of computer vision for remote sensing has effectively seen many recent advances, e.g. in tasks such as object detection or scene classification. Recent work in the area has also addressed the task of generating a natural language description of a given remote sensing image, effectively combining techniques from both natural language processing and computer vision. Despite some previously published results, there nonetheless are still many limitations and possibilities for improvement. It remains challenging to generate text that is fluid and linguistically rich while maintaining semantic consistency and good discrimination ability about the objects and visual patterns that should be described. The previous proposals that have come closest to achieving the goals of remote sensing image captioning have used neural encoder-decoder architectures, often including specialized attention mechanisms to help the system in integrating the most relevant visual features while generating the textual descriptions. Taking previous work into consideration, this work proposes a new approach for remote sensing image captioning, using an encoder-decoder model based on the Transformer architecture, and where both the encoder and the decoder are based on components from a pre-existing model that was already trained with large amounts of data. Experiments were carried out using the three main datasets that exist for assessing remote sensing image captioning methods, respectively the Sydney-captions, the \acrshort{UCM}-captions, and the \acrshort{RSICD} datasets. The results show improvements over some previous proposals, although particularly on the larger \acrshort{RSICD} dataset they are still far from the current state-of-art methods. A careful analysis of the results also points to some limitations in the current evaluation methodology, mostly based on automated n-gram overlap metrics such as BLEU or ROUGE
    • …
    corecore