4,635 research outputs found

    What does a Car-ssette tape tell?

    Full text link
    Captioning has attracted much attention in image and video understanding while little work examines audio captioning. This paper contributes a manually-annotated dataset on car scene, in extension to a previously published hospital audio captioning dataset. An encoder-decoder model with pretrained word embeddings and additional sentence loss is proposed. This current model can accelerate the training process and generate semantically correct but unseen unique sentences. We test the model on the current car dataset, previous Hospital Dataset and the Joint Dataset, indicating its generalization capability across different scenes. Further, we make an effort to provide a better objective evaluation metric, namely the BERT similarity score. It compares the semantic-level similarity and compensates for drawbacks of N-gram based metrics like BLEU, namely high scores for word-similar sentences. This new metric demonstrates higher correlation with human evaluation. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions in many aspects

    Exploring Visual Relationship for Image Captioning

    Full text link
    It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.Comment: ECCV 201

    Pointing Novel Objects in Image Captioning

    Full text link
    Image captioning has received significant attention with remarkable improvements in recent advances. Nevertheless, images in the wild encapsulate rich knowledge and cannot be sufficiently described with models built on image-caption pairs containing only in-domain objects. In this paper, we propose to address the problem by augmenting standard deep captioning architectures with object learners. Specifically, we present Long Short-Term Memory with Pointing (LSTM-P) --- a new architecture that facilitates vocabulary expansion and produces novel objects via pointing mechanism. Technically, object learners are initially pre-trained on available object recognition data. Pointing in LSTM-P then balances the probability between generating a word through LSTM and copying a word from the recognized objects at each time step in decoder stage. Furthermore, our captioning encourages global coverage of objects in the sentence. Extensive experiments are conducted on both held-out COCO image captioning and ImageNet datasets for describing novel objects, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, we obtain an average of 60.9% in F1 score on held-out COCO~dataset.Comment: CVPR 201

    Good News, Everyone! Context driven entity-aware captioning for news images

    Full text link
    Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in the scene and their relations. Humans, on the contrary, interpret images by integrating several sources of prior knowledge of the world. In this work, we aim to take a step closer to producing captions that offer a plausible interpretation of the scene, by integrating such contextual information into the captioning pipeline. For this we focus on the captioning of images used to illustrate news articles. We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image. Our model is able to selectively draw information from the article guided by visual cues, and to dynamically extend the output dictionary to out-of-vocabulary named entities that appear in the context source. Furthermore we introduce `GoodNews', the largest news image captioning dataset in the literature and demonstrate state-of-the-art results.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019

    Recurrent Topic-Transition GAN for Visual Paragraph Generation

    Full text link
    A natural image usually conveys rich semantic content and can be viewed from different angles. Existing image description methods are largely restricted by small sets of biased visual paragraph annotations, and fail to cover rich underlying semantics. In this paper, we investigate a semi-supervised paragraph generative framework that is able to synthesize diverse and semantically coherent paragraph descriptions by reasoning over local semantic regions and exploiting linguistic knowledge. The proposed Recurrent Topic-Transition Generative Adversarial Network (RTT-GAN) builds an adversarial framework between a structured paragraph generator and multi-level paragraph discriminators. The paragraph generator generates sentences recurrently by incorporating region-based visual and language attention mechanisms at each step. The quality of generated paragraph sentences is assessed by multi-level adversarial discriminators from two aspects, namely, plausibility at sentence level and topic-transition coherence at paragraph level. The joint adversarial training of RTT-GAN drives the model to generate realistic paragraphs with smooth logical transition between sentence topics. Extensive quantitative experiments on image and video paragraph datasets demonstrate the effectiveness of our RTT-GAN in both supervised and semi-supervised settings. Qualitative results on telling diverse stories for an image also verify the interpretability of RTT-GAN.Comment: 10 pages, 6 figure

    Vision-to-Language Tasks Based on Attributes and Attention Mechanism

    Full text link
    Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it into natural language sentences. While they neglect high-level semantic concepts and subtle relationships between image regions and natural language elements. To make full use of these information, this paper attempt to exploit the text guided attention and semantic-guided attention (SA) to find the more correlated spatial information and reduce the semantic gap between vision and language. Our method includes two level attention networks. One is the text-guided attention network which is used to select the text-related regions. The other is SA network which is used to highlight the concept-related regions and the region-related concepts. At last, all these information are incorporated to generate captions or answers. Practically, image captioning and visual question answering experiments have been carried out, and the experimental results have shown the excellent performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference

    CNN+CNN: Convolutional Decoders for Image Captioning

    Full text link
    Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image, and recurrent neural network (RNN) or long-short term memory (LSTM) based models dominate this field. However, RNNs or LSTMs cannot be calculated in parallel and ignore the underlying hierarchical structure of a sentence. In this paper, we propose a framework that only employs convolutional neural networks (CNNs) to generate captions. Owing to parallel computing, our basic model is around 3 times faster than NIC (an LSTM-based model) during training time, while also providing better results. We conduct extensive experiments on MSCOCO and investigate the influence of the model width and depth. Compared with LSTM-based models that apply similar attention mechanisms, our proposed models achieves comparable scores of BLEU-1,2,3,4 and METEOR, and higher scores of CIDEr. We also test our model on the paragraph annotation dataset, and get higher CIDEr score compared with hierarchical LSTM

    A sequential guiding network with attention for image captioning

    Full text link
    The recent advances of deep learning in both computer vision (CV) and natural language processing (NLP) provide us a new way of understanding semantics, by which we can deal with more challenging tasks such as automatic description generation from natural images. In this challenge, the encoder-decoder framework has achieved promising performance when a convolutional neural network (CNN) is used as image encoder and a recurrent neural network (RNN) as decoder. In this paper, we introduce a sequential guiding network that guides the decoder during word generation. The new model is an extension of the encoder-decoder framework with attention that has an additional guiding long short-term memory (LSTM) and can be trained in an end-to-end manner by using image/descriptions pairs. We validate our approach by conducting extensive experiments on a benchmark dataset, i.e., MS COCO Captions. The proposed model achieves significant improvement comparing to the other state-of-the-art deep learning models.Comment: 5 pages, 2 figures, 1 table, IEEE ICASSP 201

    Object Hallucination in Image Captioning

    Full text link
    Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.Comment: Rohrbach and Hendricks contributed equally; accepted to EMNLP 201

    A Comprehensive Survey of Deep Learning for Image Captioning

    Full text link
    Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.Comment: 36 Pages, Accepted as a Journal Paper in ACM Computing Surveys (October 2018
    corecore