Search CORE

225 research outputs found

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Author: Anderson Peter
Fernando Basura
Gould Stephen
Johnson Mark
Publication venue
Publication date: 01/01/2017
Field of study

Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

Neural Baby Talk

Author: Batra Dhruv
Lu Jiasen
Parikh Devi
Yang Jianwei
Publication venue
Publication date: 26/03/2018
Field of study

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentence `template' with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the regions by object detectors. The entire architecture (sentence template generation and slot filling with object detectors) is end-to-end differentiable. We verify the effectiveness of our proposed model on different image captioning tasks. On standard image captioning and novel object captioning, our model reaches state-of-the-art on both COCO and Flickr30k datasets. We also demonstrate that our model has unique advantages when the train and test distributions of scene compositions -- and hence language priors of associated captions -- are different. Code has been made available at: https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201

arXiv.org e-Print Archive

Crossref

TIGS: An Inference Algorithm for Text Infilling with Gradient Search

Author: Fu Jie
Liu Dayiheng
Liu Pengfei
Lv Jiancheng
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Text infilling is defined as a task for filling in the missing part of a sentence or paragraph, which is suitable for many real-world natural language generation scenarios. However, given a well-trained sequential generative model, generating missing symbols conditioned on the context is challenging for existing greedy approximate inference algorithms. In this paper, we propose an iterative inference algorithm based on gradient search, which is the first inference algorithm that can be broadly applied to any neural sequence generative models for text infilling tasks. We compare the proposed method with strong baselines on three text infilling tasks with various mask ratios and different mask strategies. The results show that our proposed method is effective and efficient for fill-in-the-blank tasks, consistently outperforming all baselines.Comment: The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019

arXiv.org e-Print Archive

Crossref

PolyPublie