Image captioning models are usually evaluated on their ability to describe a
held-out set of images, not on their ability to generalize to unseen concepts.
We study the problem of compositional generalization, which measures how well a
model composes unseen combinations of concepts when describing images.
State-of-the-art image captioning models show poor generalization performance
on this task. We propose a multi-task model to address the poor performance,
that combines caption generation and image--sentence ranking, and uses a
decoding mechanism that re-ranks the captions according their similarity to the
image. This model is substantially better at generalizing to unseen
combinations of concepts compared to state-of-the-art captioning models.Comment: To appear at CoNLL 2019, EMNL