Image caption generation is a long standing and challenging problem at the
intersection of computer vision and natural language processing. A number of
recently proposed approaches utilize a fully supervised object recognition
model within the captioning approach. Such models, however, tend to generate
sentences which only consist of objects predicted by the recognition models,
excluding instances of the classes without labelled training examples. In this
paper, we propose a new challenging scenario that targets the image captioning
problem in a fully zero-shot learning setting, where the goal is to be able to
generate captions of test images containing objects that are not seen during
training. The proposed approach jointly uses a novel zero-shot object detection
model and a template-based sentence generator. Our experiments show promising
results on the COCO dataset.Comment: To appear in British Machine Vision Conference (BMVC) 201