6 research outputs found
Automatic caption generation for news images
This thesis is concerned with the task of automatically generating captions for images,
which is important for many image-related applications. Automatic description generation
for video frames would help security authorities manage more efficiently and
utilize large volumes of monitoring data. Image search engines could potentially benefit
from image description in supporting more accurate and targeted queries for end
users. Importantly, generating image descriptions would aid blind or partially sighted
people who cannot access visual information in the same way as sighted people can.
However, previous work has relied on fine-gained resources, manually created for specific
domains and applications In this thesis, we explore the feasibility of automatic
caption generation for news images in a knowledge-lean way. We depart from previous
work, as we learn a model of caption generation from publicly available data that
has not been explicitly labelled for our task. The model consists of two components,
namely extracting image content and rendering it in natural language.
Specifically, we exploit data resources where images and their textual descriptions
co-occur naturally. We present a new dataset consisting of news articles, images, and
their captions that we required from the BBC News website. Rather than laboriously
annotating images with keywords, we simply treat the captions as the labels. We show
that it is possible to learn the visual and textual correspondence under such noisy conditions
by extending an existing generative annotation model (Lavrenko et al., 2003).
We also find that the accompanying news documents substantially complements the
extraction of the image content. In order to provide a better modelling and representation
of image content,We propose a probabilistic image annotation model that exploits
the synergy between visual and textual modalities under the assumption that images
and their textual descriptions are generated by a shared set of latent variables (topics).
Using Latent Dirichlet Allocation (Blei and Jordan, 2003), we represent visual and
textual modalities jointly as a probability distribution over a set of topics. Our model
takes these topic distributions into account while finding the most likely keywords for
an image and its associated document.
The availability of news documents in our dataset allows us to perform the caption
generation task in a fashion akin to text summarization; save one important difference
that our model is not solely based on text but uses the image in order to select content
from the document that should be present in the caption. We propose both extractive
and abstractive caption generation models to render the extracted image content
in natural language without relying on rich knowledge resources, sentence-templates or grammars. The backbone for both approaches is our topic-based image annotation
model. Our extractive models examine how to best select sentences that overlap in
content with our image annotation model. We modify an existing abstractive headline
generation model to our scenario by incorporating visual information. Our own
model operates over image description keywords and document phrases by taking dependency
and word order constraints into account. Experimental results show that both
approaches can generate human-readable captions for news images. Our phrase-based
abstractive model manages to yield as informative captions as those written by the
BBC journalists