341 research outputs found
Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets
A wide range of image captioning models has been developed, achieving
significant improvement based on popular metrics, such as BLEU, CIDEr, and
SPICE. However, although the generated captions can accurately describe the
image, they are generic for similar images and lack distinctiveness, i.e.,
cannot properly describe the uniqueness of each image. In this paper, we aim to
improve the distinctiveness of image captions through training with sets of
similar images. First, we propose a distinctiveness metric -- between-set CIDEr
(CIDErBtw) to evaluate the distinctiveness of a caption with respect to those
of similar images. Our metric shows that the human annotations of each image
are not equivalent based on distinctiveness. Thus we propose several new
training strategies to encourage the distinctiveness of the generated caption
for each image, which are based on using CIDErBtw in a weighted loss function
or as a reinforcement learning reward. Finally, extensive experiments are
conducted, showing that our proposed approach significantly improves both
distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy
(e.g., as measured by CIDEr) for a wide variety of image captioning baselines.
These results are further confirmed through a user study
How to Describe Images in a More Funny Way? Towards a Modular Approach to Cross-Modal Sarcasm Generation
Sarcasm generation has been investigated in previous studies by considering
it as a text-to-text generation problem, i.e., generating a sarcastic sentence
for an input sentence. In this paper, we study a new problem of cross-modal
sarcasm generation (CMSG), i.e., generating a sarcastic description for a given
image. CMSG is challenging as models need to satisfy the characteristics of
sarcasm, as well as the correlation between different modalities. In addition,
there should be some inconsistency between the two modalities, which requires
imagination. Moreover, high-quality training data is insufficient. To address
these problems, we take a step toward generating sarcastic descriptions from
images without paired training data and propose an
Extraction-Generation-Ranking based Modular method (EGRM) for cross-model
sarcasm generation. Specifically, EGRM first extracts diverse information from
an image at different levels and uses the obtained image tags, sentimental
descriptive caption, and commonsense-based consequence to generate candidate
sarcastic texts. Then, a comprehensive ranking algorithm, which considers
image-text relation, sarcasticness, and grammaticality, is proposed to select a
final text from the candidate texts. Human evaluation at five criteria on a
total of 1200 generated image-text pairs from eight systems and auxiliary
automatic evaluation show the superiority of our method
A-CAP: Anticipation Captioning with Commonsense Knowledge
Humans possess the capacity to reason about the future based on a sparse
collection of visual cues acquired over time. In order to emulate this ability,
we introduce a novel task called Anticipation Captioning, which generates a
caption for an unseen oracle image using a sparsely temporally-ordered set of
images. To tackle this new task, we propose a model called A-CAP, which
incorporates commonsense knowledge into a pre-trained vision-language model,
allowing it to anticipate the caption. Through both qualitative and
quantitative evaluations on a customized visual storytelling dataset, A-CAP
outperforms other image captioning methods and establishes a strong baseline
for anticipation captioning. We also address the challenges inherent in this
task.Comment: Accepted to CVPR 202
FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions
Image captioning is a central task in computer vision which has experienced
substantial progress following the advent of vision-language pre-training
techniques. In this paper, we highlight a frequently overlooked limitation of
captioning models that often fail to capture semantically significant elements.
This drawback can be traced back to the text-image datasets; while their
captions typically offer a general depiction of image content, they frequently
omit salient details. To mitigate this limitation, we propose FuseCap - a novel
method for enriching captions with additional visual information, obtained from
vision experts, such as object detectors, attribute recognizers, and Optical
Character Recognizers (OCR). Our approach fuses the outputs of such vision
experts with the original caption using a large language model (LLM), yielding
enriched captions that present a comprehensive image description. We validate
the effectiveness of the proposed caption enrichment method through both
quantitative and qualitative analysis. Our method is then used to curate the
training set of a captioning model based BLIP which surpasses current
state-of-the-art approaches in generating accurate and detailed captions while
using significantly fewer parameters and training data. As additional
contributions, we provide a dataset comprising of 12M image-enriched caption
pairs and show that the proposed method largely improves image-text retrieval
Automatic Image Captioning with Style
This thesis connects two core topics in machine learning, vision
and language. The problem of choice is image caption generation:
automatically constructing natural language descriptions of image
content. Previous research into image caption generation has
focused on generating purely descriptive captions; I focus on
generating visually relevant captions with a distinct linguistic
style. Captions with style have the potential to ease
communication and add a new layer of personalisation.
First, I consider naming variations in image captions, and
propose a method for predicting context-dependent names that
takes into account visual and linguistic information. This method
makes use of a large-scale image caption dataset, which I also
use to explore naming conventions and report naming conventions
for hundreds of animal classes. Next I propose the SentiCap
model, which relies on recent advances in artificial neural
networks to generate visually relevant image captions with
positive or negative sentiment. To balance descriptiveness and
sentiment, the SentiCap model dynamically switches between two
recurrent neural networks, one tuned for descriptive words and
one for sentiment words. As the first published model for
generating captions with sentiment, SentiCap has influenced a
number of subsequent works. I then investigate the sub-task of
modelling styled sentences without images. The specific task
chosen is sentence simplification: rewriting news article
sentences to make them easier to understand.
For this task I design a neural sequence-to-sequence model that
can work with
limited training data, using novel adaptations for word copying
and sharing
word embeddings. Finally, I present SemStyle, a system for
generating visually
relevant image captions in the style of an arbitrary text corpus.
A shared term
space allows a neural network for vision and content planning to
communicate
with a network for styled language generation. SemStyle achieves
competitive
results in human and automatic evaluations of descriptiveness and
style.
As a whole, this thesis presents two complete systems for styled
caption generation that are first of their kind and demonstrate,
for the first time, that automatic style transfer for image
captions is achievable. Contributions also include novel ideas
for object naming and sentence simplification. This thesis opens
up inquiries into highly personalised image captions; large scale
visually grounded concept naming; and more generally, styled text
generation with content control
- …