Constructing an organized dataset comprised of a large number of images and
several captions for each image is a laborious task, which requires vast human
effort. On the other hand, collecting a large number of images and sentences
separately may be immensely easier. In this paper, we develop a novel
data-efficient semi-supervised framework for training an image captioning
model. We leverage massive unpaired image and caption data by learning to
associate them. To this end, our proposed semi-supervised learning method
assigns pseudo-labels to unpaired samples via Generative Adversarial Networks
to learn the joint distribution of image and caption. To evaluate, we construct
scarcely-paired COCO dataset, a modified version of MS COCO caption dataset.
The empirical results show the effectiveness of our method compared to several
strong baselines, especially when the amount of the paired samples are scarce.Comment: EMNLP 2019. Project page :
https://sites.google.com/view/emnlp19scarcecaptio