970 research outputs found
Compositional Generalization in Image Captioning
Image captioning models are usually evaluated on their ability to describe a
held-out set of images, not on their ability to generalize to unseen concepts.
We study the problem of compositional generalization, which measures how well a
model composes unseen combinations of concepts when describing images.
State-of-the-art image captioning models show poor generalization performance
on this task. We propose a multi-task model to address the poor performance,
that combines caption generation and image--sentence ranking, and uses a
decoding mechanism that re-ranks the captions according their similarity to the
image. This model is substantially better at generalizing to unseen
combinations of concepts compared to state-of-the-art captioning models.Comment: To appear at CoNLL 2019, EMNL
The Role of Syntactic Planning in Compositional Image Captioning
Image captioning has focused on generalizing to images drawn from the same
distribution as the training set, and not to the more challenging problem of
generalizing to different distributions of images. Recently, Nikolaus et al.
(2019) introduced a dataset to assess compositional generalization in image
captioning, where models are evaluated on their ability to describe images with
unseen adjective-noun and noun-verb compositions. In this work, we investigate
different methods to improve compositional generalization by planning the
syntactic structure of a caption. Our experiments show that jointly modeling
tokens and syntactic tags enhances generalization in both RNN- and
Transformer-based models, while also improving performance on standard metrics.Comment: Accepted at EACL 202
Video Captioning with Guidance of Multimodal Latent Topics
The topic diversity of open-domain videos leads to various vocabularies and
linguistic expressions in describing video contents, and therefore, makes the
video captioning task even more challenging. In this paper, we propose an
unified caption framework, M&M TGM, which mines multimodal topics in
unsupervised fashion from data and guides the caption decoder with these
topics. Compared to pre-defined topics, the mined multimodal topics are more
semantically and visually coherent and can reflect the topic distribution of
videos better. We formulate the topic-aware caption generation as a multi-task
learning problem, in which we add a parallel task, topic prediction, in
addition to the caption task. For the topic prediction task, we use the mined
topics as the teacher to train a student topic prediction model, which learns
to predict the latent topics from multimodal contents of videos. The topic
prediction provides intermediate supervision to the learning process. As for
the caption task, we propose a novel topic-aware decoder to generate more
accurate and detailed video descriptions with the guidance from latent topics.
The entire learning procedure is end-to-end and it optimizes both tasks
simultaneously. The results from extensive experiments conducted on the MSR-VTT
and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
M&M TGM not only outperforms prior state-of-the-art methods on multiple
evaluation metrics and on both benchmark datasets, but also achieves better
generalization ability.Comment: ACM Multimedia 201
- …