16,172 research outputs found
Concadia: Towards Image-Based Text Generation with a Purpose
Current deep learning models often achieve excellent results on benchmark
image-to-text datasets but fail to generate texts that are useful in practice.
We argue that to close this gap, it is vital to distinguish descriptions from
captions based on their distinct communicative roles. Descriptions focus on
visual features and are meant to replace an image (often to increase
accessibility), whereas captions appear alongside an image to supply additional
information. To motivate this distinction and help people put it into practice,
we introduce the publicly available Wikipedia-based dataset Concadia consisting
of 96,918 images with corresponding English-language descriptions, captions,
and surrounding context. Using insights from Concadia, models trained on it,
and a preregistered human-subjects experiment with human- and model-generated
texts, we characterize the commonalities and differences between descriptions
and captions. In addition, we show that, for generating both descriptions and
captions, it is useful to augment image-to-text models with representations of
the textual context in which the image appeared.Comment: Proceedings of EMNLP 202
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
Semantic bottleneck for computer vision tasks
This paper introduces a novel method for the representation of images that is
semantic by nature, addressing the question of computation intelligibility in
computer vision tasks. More specifically, our proposition is to introduce what
we call a semantic bottleneck in the processing pipeline, which is a crossing
point in which the representation of the image is entirely expressed with
natural language , while retaining the efficiency of numerical representations.
We show that our approach is able to generate semantic representations that
give state-of-the-art results on semantic content-based image retrieval and
also perform very well on image classification tasks. Intelligibility is
evaluated through user centered experiments for failure detection
Image Representations and New Domains in Neural Image Captioning
We examine the possibility that recent promising results in automatic caption
generation are due primarily to language models. By varying image
representation quality produced by a convolutional neural network, we find that
a state-of-the-art neural captioning algorithm is able to produce quality
captions even when provided with surprisingly poor image representations. We
replicate this result in a new, fine-grained, transfer learned captioning
domain, consisting of 66K recipe image/title pairs. We also provide some
experiments regarding the appropriateness of datasets for automatic captioning,
and find that having multiple captions per image is beneficial, but not an
absolute requirement.Comment: 11 Pages, 5 Images, To appear at EMNLP 2015's Vision + Learning
worksho
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
Real world multimedia data is often composed of multiple modalities such as
an image or a video with associated text (e.g. captions, user comments, etc.)
and metadata. Such multimodal data packages are prone to manipulations, where a
subset of these modalities can be altered to misrepresent or repurpose data
packages, with possible malicious intent. It is, therefore, important to
develop methods to assess or verify the integrity of these multimedia packages.
Using computer vision and natural language processing methods to directly
compare the image (or video) and the associated caption to verify the integrity
of a media package is only possible for a limited set of objects and scenes. In
this paper, we present a novel deep learning-based approach for assessing the
semantic integrity of multimedia packages containing images and captions, using
a reference set of multimedia packages. We construct a joint embedding of
images and captions with deep multimodal representation learning on the
reference dataset in a framework that also provides image-caption consistency
scores (ICCSs). The integrity of query media packages is assessed as the
inlierness of the query ICCSs with respect to the reference dataset. We present
the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media
packages from Flickr, which we make available to the research community. We use
both the newly created dataset as well as Flickr30K and MS COCO datasets to
quantitatively evaluate our proposed approach. The reference dataset does not
contain unmanipulated versions of tampered query packages. Our method is able
to achieve F1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO,
respectively, for detecting semantically incoherent media packages.Comment: *Ayush Jaiswal and Ekraam Sabir contributed equally to the work in
this pape
- …