225 research outputs found
Guided Open Vocabulary Image Captioning with Constrained Beam Search
Existing image captioning models do not generalize well to out-of-domain
images containing novel scenes or objects. This limitation severely hinders the
use of these models in real world applications dealing with images in the wild.
We address this problem using a flexible approach that enables existing deep
captioning architectures to take advantage of image taggers at test time,
without re-training. Our method uses constrained beam search to force the
inclusion of selected tag words in the output, and fixed, pretrained word
embeddings to facilitate vocabulary expansion to previously unseen tag words.
Using this approach we achieve state of the art results for out-of-domain
captioning on MSCOCO (and improved results for in-domain captioning). Perhaps
surprisingly, our results significantly outperform approaches that incorporate
the same tag predictions into the learning algorithm. We also show that we can
significantly improve the quality of generated ImageNet captions by leveraging
ground-truth labels.Comment: EMNLP 201
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural
language explicitly grounded in entities that object detectors find in the
image. Our approach reconciles classical slot filling approaches (that are
generally better grounded in images) with modern neural captioning approaches
(that are generally more natural sounding and accurate). Our approach first
generates a sentence `template' with slot locations explicitly tied to specific
image regions. These slots are then filled in by visual concepts identified in
the regions by object detectors. The entire architecture (sentence template
generation and slot filling with object detectors) is end-to-end
differentiable. We verify the effectiveness of our proposed model on different
image captioning tasks. On standard image captioning and novel object
captioning, our model reaches state-of-the-art on both COCO and Flickr30k
datasets. We also demonstrate that our model has unique advantages when the
train and test distributions of scene compositions -- and hence language priors
of associated captions -- are different. Code has been made available at:
https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201
TIGS: An Inference Algorithm for Text Infilling with Gradient Search
Text infilling is defined as a task for filling in the missing part of a
sentence or paragraph, which is suitable for many real-world natural language
generation scenarios. However, given a well-trained sequential generative
model, generating missing symbols conditioned on the context is challenging for
existing greedy approximate inference algorithms. In this paper, we propose an
iterative inference algorithm based on gradient search, which is the first
inference algorithm that can be broadly applied to any neural sequence
generative models for text infilling tasks. We compare the proposed method with
strong baselines on three text infilling tasks with various mask ratios and
different mask strategies. The results show that our proposed method is
effective and efficient for fill-in-the-blank tasks, consistently outperforming
all baselines.Comment: The 57th Annual Meeting of the Association for Computational
Linguistics (ACL 2019
- …