Image captioning models have been able to generate grammatically correct and
human understandable sentences. However most of the captions convey limited
information as the model used is trained on datasets that do not caption all
possible objects existing in everyday life. Due to this lack of prior
information most of the captions are biased to only a few objects present in
the scene, hence limiting their usage in daily life. In this paper, we attempt
to show the biased nature of the currently existing image captioning models and
present a new image captioning dataset, Egoshots, consisting of 978 real life
images with no captions. We further exploit the state of the art pre-trained
image captioning and object recognition networks to annotate our images and
show the limitations of existing works. Furthermore, in order to evaluate the
quality of the generated captions, we propose a new image captioning metric,
object based Semantic Fidelity (SF). Existing image captioning metrics can
evaluate a caption only in the presence of their corresponding annotations;
however, SF allows evaluating captions generated for images without
annotations, making it highly useful for real life generated captions.Comment: 15 pages, 25 figures, Accepted at Machine Learning in Real Life
(ML-IRL) ICLR 2020 Worksho