14 research outputs found

    AIC-AB NET: A Neural Network for Image Captioning with Spatial Attention and Text Attributes

    Full text link
    Image captioning is a significant field across computer vision and natural language processing. We propose and present AIC-AB NET, a novel Attribute-Information-Combined Attention-Based Network that combines spatial attention architecture and text attributes in an encoder-decoder. For caption generation, adaptive spatial attention determines which image region best represents the image and whether to attend to the visual features or the visual sentinel. Text attribute information is synchronously fed into the decoder to help image recognition and reduce uncertainty. We have tested and evaluated our AICAB NET on the MS COCO dataset and a new proposed Fashion dataset. The Fashion dataset is employed as a benchmark of single-object images. The results show the superior performance of the proposed model compared to the state-of-the-art baseline and ablated models on both the images from MSCOCO and our single-object images. Our AIC-AB NET outperforms the baseline adaptive attention network by 0.017 (CIDEr score) on the MS COCO dataset and 0.095 (CIDEr score) on the Fashion dataset

    Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning

    Get PDF
    High-resolution remote sensing images are now available with the progress of remote sensing technology. With respect to popular remote sensing tasks, such as scene classification, image captioning provides comprehensible information about such images by summarizing the image content in human-readable text. Most existing remote sensing image captioning methods are based on deep learning-based encoder–decoder frameworks, using convolutional neural network or recurrent neural network as the backbone of such frameworks. Such frameworks show a limited capability to analyze sequential data and cope with the lack of captioned remote sensing training images. Recently introduced Transformer architecture exploits self-attention to obtain superior performance for sequence-analysis tasks. Inspired by this, in this work, we employ a Transformer as an encoder–decoder for remote sensing image captioning. Moreover, to deal with the limited training data, an auxiliary decoder is used that further helps the encoder in the training process. The auxiliary decoder is trained for multilabel scene classification due to its conceptual similarity to image captioning and capability of highlighting semantic classes. To the best of our knowledge, this is the first work exploiting multilabel classification to improve remote sensing image captioning. Experimental results on the University of California (UC)-Merced caption dataset show the efficacy of the proposed method. The implementation details can be found in https://gitlab.lrz.de/ai4eo/captioningMultilabel

    The Role of Syntactic Planning in Compositional Image Captioning

    Full text link
    Image captioning has focused on generalizing to images drawn from the same distribution as the training set, and not to the more challenging problem of generalizing to different distributions of images. Recently, Nikolaus et al. (2019) introduced a dataset to assess compositional generalization in image captioning, where models are evaluated on their ability to describe images with unseen adjective-noun and noun-verb compositions. In this work, we investigate different methods to improve compositional generalization by planning the syntactic structure of a caption. Our experiments show that jointly modeling tokens and syntactic tags enhances generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.Comment: Accepted at EACL 202
    corecore