10,194 research outputs found
Generating Natural Questions About an Image
There has been an explosion of work in the vision & language community during
the past few years from image captioning to video transcription, and answering
questions about images. These tasks have focused on literal descriptions of the
image. To move beyond the literal, we choose to explore how questions about an
image are often directed at commonsense inference and the abstract events
evoked by objects in the image. In this paper, we introduce the novel task of
Visual Question Generation (VQG), where the system is tasked with asking a
natural and engaging question when shown an image. We provide three datasets
which cover a variety of images from object-centric to event-centric, with
considerably more abstract training data than provided to state-of-the-art
captioning systems thus far. We train and test several generative and retrieval
models to tackle the task of VQG. Evaluation results show that while such
models ask reasonable questions for a variety of images, there is still a wide
gap with human performance which motivates further work on connecting images
with commonsense knowledge and pragmatics. Our proposed task offers a new
challenge to the community which we hope furthers interest in exploring deeper
connections between vision & language.Comment: Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistic
SentiCap: Generating Image Descriptions with Sentiments
The recent progress on image recognition and language modeling is making
automatic description of image content a reality. However, stylized,
non-factual aspects of the written description are missing from the current
systems. One such style is descriptions with emotions, which is commonplace in
everyday communication, and influences decision-making and interpersonal
relationships. We design a system to describe an image with emotions, and
present a model that automatically generates captions with positive or negative
sentiments. We propose a novel switching recurrent neural network with
word-level regularization, which is able to produce emotional image captions
using only 2000+ training sentences containing sentiments. We evaluate the
captions with different automatic and crowd-sourcing metrics. Our model
compares favourably in common quality metrics for image captioning. In 84.6% of
cases the generated positive captions were judged as being at least as
descriptive as the factual captions. Of these positive captions 88% were
confirmed by the crowd-sourced workers as having the appropriate sentiment
Knowledge-rich Image Gist Understanding Beyond Literal Meaning
We investigate the problem of understanding the message (gist) conveyed by
images and their captions as found, for instance, on websites or news articles.
To this end, we propose a methodology to capture the meaning of image-caption
pairs on the basis of large amounts of machine-readable knowledge that has
previously been shown to be highly effective for text understanding. Our method
identifies the connotation of objects beyond their denotation: where most
approaches to image understanding focus on the denotation of objects, i.e.,
their literal meaning, our work addresses the identification of connotations,
i.e., iconic meanings of objects, to understand the message of images. We view
image understanding as the task of representing an image-caption pair on the
basis of a wide-coverage vocabulary of concepts such as the one provided by
Wikipedia, and cast gist detection as a concept-ranking problem with
image-caption pairs as queries. To enable a thorough investigation of the
problem of gist understanding, we produce a gold standard of over 300
image-caption pairs and over 8,000 gist annotations covering a wide variety of
topics at different levels of abstraction. We use this dataset to
experimentally benchmark the contribution of signals from heterogeneous
sources, namely image and text. The best result with a Mean Average Precision
(MAP) of 0.69 indicate that by combining both dimensions we are able to better
understand the meaning of our image-caption pairs than when using language or
vision information alone. We test the robustness of our gist detection approach
when receiving automatically generated input, i.e., using automatically
generated image tags or generated captions, and prove the feasibility of an
end-to-end automated process
Automatic Generation of Grounded Visual Questions
In this paper, we propose the first model to be able to generate visually
grounded questions with diverse types for a single image. Visual question
generation is an emerging topic which aims to ask questions in natural language
based on visual input. To the best of our knowledge, it lacks automatic methods
to generate meaningful questions with various types for the same visual input.
To circumvent the problem, we propose a model that automatically generates
visually grounded questions with varying types. Our model takes as input both
images and the captions generated by a dense caption model, samples the most
probable question types, and generates the questions in sequel. The
experimental results on two real world datasets show that our model outperforms
the strongest baseline in terms of both correctness and diversity with a wide
margin.Comment: VQ
- …