4 research outputs found
Generating Diverse and Meaningful Captions
Image Captioning is a task that requires models to acquire a multi-modal
understanding of the world and to express this understanding in natural
language text. While the state-of-the-art for this task has rapidly improved in
terms of n-gram metrics, these models tend to output the same generic captions
for similar images. In this work, we address this limitation and train a model
that generates more diverse and specific captions through an unsupervised
training approach that incorporates a learning signal from an Image Retrieval
model. We summarize previous results and improve the state-of-the-art on
caption diversity and novelty. We make our source code publicly available
online.Comment: Accepted for presentation at The 27th International Conference on
Artificial Neural Networks (ICANN 2018
Generating Diverse and Meaningful Captions: Unsupervised Specificity Optimization for Image Captioning
Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram metrics, these models tend to output the same generic captions for similar images. In this work, we address this limitation and train a model that generates more diverse and specific captions through an unsupervised training approach that incorporates a learning signal from an Image Retrieval model. We summarize previous results and improve the state-of-the-art on caption diversity and novelty.
We make our source code publicly available online: https://github.com/AnnikaLindh/Diverse_and_Specific_Image_Captionin
Entity-Grounded Image Captioning
An urgent limitation in current Image Captioning models is their tendency to produce generic captions that avoid the interesting detail which makes each image unique. To address this limitation, we propose an approach that enforces a stronger alignment between image regions and specific segments of text. The model architecture is composed of a visual region proposer, a region-order planner and a region-guided caption generator. The region-guided caption generator incorporates a novel information gate which allows visual and textual input of different frequencies and dimensionalities in a Recurrent Neural Network