17 research outputs found
An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild
Zero-shot learning (ZSL) methods have been studied in the unrealistic setting
where test data are assumed to come from unseen classes only. In this paper, we
advocate studying the problem of generalized zero-shot learning (GZSL) where
the test data's class memberships are unconstrained. We show empirically that
naively using the classifiers constructed by ZSL approaches does not perform
well in the generalized setting. Motivated by this, we propose a simple but
effective calibration method that can be used to balance two conflicting
forces: recognizing data from seen classes versus those from unseen ones. We
develop a performance metric to characterize such a trade-off and examine the
utility of this metric in evaluating various ZSL approaches. Our analysis
further shows that there is a large gap between the performance of existing
approaches and an upper bound established via idealized semantic embeddings,
suggesting that improving class semantic embeddings is vital to GZSL.Comment: ECCV2016 camera-read
Weakly Supervised Content Selection for Improved Image Captioning
Image captioning involves identifying semantic concepts in the scene and
describing them in fluent natural language. Recent approaches do not explicitly
model the semantic concepts and train the model only for the end goal of
caption generation. Such models lack interpretability and controllability,
primarily due to sub-optimal content selection. We address this problem by
breaking down the captioning task into two simpler, manageable and more
controllable tasks -- skeleton prediction and skeleton-based caption
generation. We approach the former as a weakly supervised task, using a simple
off-the-shelf language syntax parser and avoiding the need for additional human
annotations; the latter uses a supervised-learning approach. We investigate
three methods of conditioning the caption on skeleton in the encoder, decoder
and both. Our compositional model generates significantly better quality
captions on out of domain test images, as judged by human annotators.
Additionally, we demonstrate the cross-language effectiveness of the English
skeleton to other languages including French, Italian, German, Spanish and
Hindi. This compositional nature of captioning exhibits the potential of
unpaired image captioning, thereby reducing the dependence on expensive
image-caption pairs. Furthermore, we investigate the use of skeletons as a knob
to control certain properties of the generated image caption, such as length,
content, and gender expression
What You See is What You Read? Improving Text-Image Alignment Evaluation
Automatically determining whether a text and a corresponding image are
semantically aligned is a significant challenge for vision-language models,
with applications in generative text-to-image and image-to-text tasks. In this
work, we study methods for automatic text-image alignment evaluation. We first
introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets
from both text-to-image and image-to-text generation tasks, with human
judgements for whether a given text-image pair is semantically aligned. We then
describe two automatic methods to determine alignment: the first involving a
pipeline based on question generation and visual question answering models, and
the second employing an end-to-end classification approach by finetuning
multimodal pretrained models. Both methods surpass prior approaches in various
text-image alignment tasks, with significant improvements in challenging cases
that involve complex composition or unnatural images. Finally, we demonstrate
how our approaches can localize specific misalignments between an image and a
given text, and how they can be used to automatically re-rank candidates in
text-to-image generation
PreSTU: Pre-Training for Scene-Text Understanding
The ability to recognize and reason about text embedded in visual inputs is
often lacking in vision-and-language (V&L) models, perhaps because V&L
pre-training methods have often failed to include such an ability in their
training objective. In this paper, we propose PreSTU, a novel pre-training
recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware
pre-training objectives that encourage the model to recognize text from an
image and connect it to the rest of the image content. We implement PreSTU
using a simple transformer-based encoder-decoder architecture, combined with
large-scale image-text datasets with scene text obtained from an off-the-shelf
OCR system. We empirically demonstrate the effectiveness of this pre-training
approach on eight visual question answering and four image captioning
benchmarks.Comment: Accepted to ICCV 202