9 research outputs found
IRFL: Image Recognition of Figurative Language
Figures of speech such as metaphors, similes, and idioms allow language to be
expressive, invoke emotion, and communicate abstract ideas that might otherwise
be difficult to visualize. These figurative forms are often conveyed through
multiple modes, such as text and images, and frequently appear in advertising,
news, social media, etc. Understanding multimodal figurative language is an
essential component of human communication, and it plays a significant role in
our daily interactions. While humans can intuitively understand multimodal
figurative language, this poses a challenging task for machines that requires
the cognitive ability to map between domains, abstraction, commonsense, and
profound language and cultural knowledge. In this work, we propose the Image
Recognition of Figurative Language dataset to examine vision and language
models' understanding of figurative language. We leverage human annotation and
an automatic pipeline we created to generate a multimodal dataset and introduce
two novel tasks as a benchmark for multimodal figurative understanding. We
experiment with several baseline models and find that all perform substantially
worse than humans. We hope our dataset and benchmark will drive the development
of models that will better understand figurative language
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Weird, unusual, and uncanny images pique the curiosity of observers because
they challenge commonsense. For example, an image released during the 2022
world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo
playing chess, which playfully violates our expectation that their competition
should occur on the football field. Humans can easily recognize and interpret
these unconventional images, but can AI models do the same? We introduce
WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is
comprised of purposefully commonsense-defying images created by designers using
publicly-available image generation tools like Midjourney. We consider several
tasks posed over the dataset. In addition to image captioning, cross-modal
matching, and visual question answering, we introduce a difficult explanation
generation task, where models must identify and explain why a given image is
unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2
still lag behind human performance on WHOOPS!. We hope our dataset will inspire
the development of AI models with stronger visual commonsense reasoning
abilities. Data, models and code are available at the project website:
whoops-benchmark.github.i
VideoCon: Robust Video-Language Alignment via Contrast Captions
Despite being (pre)trained on a massive amount of data, state-of-the-art
video-language alignment models are not robust to semantically-plausible
contrastive changes in the video captions. Our work addresses this by
identifying a broad spectrum of contrast misalignments, such as replacing
entities, actions, and flipping event order, which alignment models should be
robust against. To this end, we introduce the VideoCon, a video-language
alignment dataset constructed by a large language model that generates
plausible contrast video captions and explanations for differences between
original and contrast video captions. Then, a generative video-language model
is finetuned with VideoCon to assess video-language entailment and generate
explanations. Our VideoCon-based alignment model significantly outperforms
current models. It exhibits a 12-point increase in AUC for the video-language
alignment task on human-generated contrast captions. Finally, our model sets
new state of the art zero-shot performance in temporally-extensive
video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video
question answering (ATP-Hard). Moreover, our model shows superior performance
on novel videos and human-crafted captions and explanations. Our code and data
are available at https://github.com/Hritikbansal/videocon.Comment: 22 pages, 19 Figures, 7 Table
VASR: Visual Analogies of Situation Recognition
A core process in human cognition is analogical mapping: the ability to
identify a similar relational structure between different situations. We
introduce a novel task, Visual Analogies of Situation Recognition, adapting the
classical word-analogy task into the visual domain. Given a triplet of images,
the task is to select an image candidate B' that completes the analogy (A to A'
is like B to what?). Unlike previous work on visual analogy that focused on
simple image transformations, we tackle complex analogies requiring
understanding of scenes.
We leverage situation recognition annotations and the CLIP model to generate
a large set of 500k candidate analogies. Crowdsourced annotations for a sample
of the data indicate that humans agree with the dataset label ~80% of the time
(chance level 25%). Furthermore, we use human annotations to create a
gold-standard dataset of 3,820 validated analogies. Our experiments demonstrate
that state-of-the-art models do well when distractors are chosen randomly
(~86%), but struggle with carefully chosen distractors (~53%, compared to 90%
human accuracy). We hope our dataset will encourage the development of new
analogy-making models. Website: https://vasr-dataset.github.io/Comment: Accepted to AAAI 2023. Website: https://vasr-dataset.github.io
Transferring Visual Attributes from Natural Language to Verified Image Generation
Text to image generation methods (T2I) are widely popular in generating art
and other creative artifacts. While visual hallucinations can be a positive
factor in scenarios where creativity is appreciated, such artifacts are poorly
suited for cases where the generated image needs to be grounded in complex
natural language without explicit visual elements. In this paper, we propose to
strengthen the consistency property of T2I methods in the presence of natural
complex language, which often breaks the limits of T2I methods by including
non-visual information, and textual elements that require knowledge for
accurate generation. To address these phenomena, we propose a Natural Language
to Verified Image generation approach (NL2VI) that converts a natural prompt
into a visual prompt, which is more suitable for image generation. A T2I model
then generates an image for the visual prompt, which is then verified with VQA
algorithms. Experimentally, aligning natural prompts with image generation can
improve the consistency of the generated images by up to 11% over the state of
the art. Moreover, improvements can generalize to challenging domains like
cooking and DIY tasks, where the correctness of the generated image is crucial
to illustrate actions
What You See is What You Read? Improving Text-Image Alignment Evaluation
Automatically determining whether a text and a corresponding image are
semantically aligned is a significant challenge for vision-language models,
with applications in generative text-to-image and image-to-text tasks. In this
work, we study methods for automatic text-image alignment evaluation. We first
introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets
from both text-to-image and image-to-text generation tasks, with human
judgements for whether a given text-image pair is semantically aligned. We then
describe two automatic methods to determine alignment: the first involving a
pipeline based on question generation and visual question answering models, and
the second employing an end-to-end classification approach by finetuning
multimodal pretrained models. Both methods surpass prior approaches in various
text-image alignment tasks, with significant improvements in challenging cases
that involve complex composition or unnatural images. Finally, we demonstrate
how our approaches can localize specific misalignments between an image and a
given text, and how they can be used to automatically re-rank candidates in
text-to-image generation
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for
evaluation of instruction-following vision-language models for real-world use.
Our starting point is curating 70 'instruction families' that we envision
instruction tuned vision-language models should be able to address. Extending
beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to
game playing and creative generation. Following curation, our dataset comprises
592 test queries, each with a human-authored instruction-conditioned caption.
These descriptions surface instruction-specific factors, e.g., for an
instruction asking about the accessibility of a storefront for wheelchair
users, the instruction-conditioned caption describes ramps/potential obstacles.
These descriptions enable 1) collecting human-verified reference outputs for
each instance; and 2) automatic evaluation of candidate multimodal generations
using a text-only LLM, aligning with human judgment. We quantify quality gaps
between models and references using both human and automatic evaluations; e.g.,
the top-performing instruction-following model wins against the GPT-4 reference
in just 27% of the comparison. VisIT-Bench is dynamic to participate,
practitioners simply submit their model's response on the project website;
Data, code and leaderboard is available at visit-bench.github.io
DataComp: In search of the next generation of multimodal datasets
Multimodal datasets are a critical component in recent breakthroughs such as
Stable Diffusion and GPT-4, yet their design does not receive the same research
attention as model architectures or training algorithms. To address this
shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset
experiments centered around a new candidate pool of 12.8 billion image-text
pairs from Common Crawl. Participants in our benchmark design new filtering
techniques or curate new data sources and then evaluate their new dataset by
running our standardized CLIP training code and testing the resulting model on
38 downstream test sets. Our benchmark consists of multiple compute scales
spanning four orders of magnitude, which enables the study of scaling trends
and makes the benchmark accessible to researchers with varying resources. Our
baseline experiments show that the DataComp workflow leads to better training
sets. In particular, our best baseline, DataComp-1B, enables training a CLIP
ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming
OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training
procedure and compute. We release DataComp and all accompanying code at
www.datacomp.ai