9 research outputs found
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs
from traditional Textual Entailment (TE) tasks whereby a premise is defined by
an image, rather than a natural language sentence as in TE tasks. A novel
dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is
proposed for VE tasks based on the Stanford Natural Language Inference corpus
and Flickr30k. We introduce a differentiable architecture called the
Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and
several other state-of-the-art visual question answering (VQA) based models are
evaluated on the SNLI-VE dataset, facilitating grounded language understanding
and providing insights on how modern VQA based models perform.Comment: 4 pages, accepted by Visually Grounded Interaction and Language
(ViGIL) workshop in NeurIPS 201
Multimodal and Multilingual Understanding of Smells using VilBERT and mUNITER
We evaluate state-of-the-art multimodal models to detect common olfactory references in multilingual text and images in the scope of the Multimodal Understanding of Smells in Texts and Images (MUSTI) at Mediaeval’22. The goal of the MUSTI Subtask 1 is to classify paired text and images as to whether they refer to the same smell source or not. We approach this task as a Visual Entailment problem and evaluate the performance of the English model ViLBERT and the multilingual model mUNITER on MUSTI Subtask 1. Although base VilBERT and mUNITER models perform worse than a dummy baseline, fine-tuning these models improve performance significantly in almost all scenarios. We find that fine-tuning mUNITER with SNLI-VE and MUSTI train data performs better than other configurations we implemented. Our experiments demonstrate that the task presents some challenges, but it is by no means impossible. Our code is available on https://github. com/Odeuropa/musti-eval-baselines
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Large-scale vision language (VL) models use Transformers to perform
cross-modal interactions between the input text and image. These cross-modal
interactions are computationally expensive and memory-intensive due to the
quadratic complexity of processing the input image and text. We present PuMer:
a token reduction framework that uses text-informed Pruning and modality-aware
Merging strategies to progressively reduce the tokens of input image and text,
improving model inference speed and reducing memory footprint. PuMer learns to
keep salient image tokens related to the input text and merges similar textual
and visual tokens by adding lightweight token reducer modules at several
cross-modal layers in the VL model. Training PuMer is mostly the same as
finetuning the original VL model but faster. Our evaluation for two vision
language models on four downstream VL tasks shows PuMer increases inference
throughput by up to 2x and reduces memory footprint by over 50% while incurring
less than a 1% accuracy drop.Comment: Accepted to ACL 2023 Main Conferenc
Preconditioned Visual Language Inference with Weak Supervision
Humans can infer the affordance of objects by extracting related contextual
preconditions for each scenario. For example, upon seeing an image of a broken
cup, we can infer that this precondition prevents the cup from being used for
drinking. Reasoning with preconditions of commonsense is studied in NLP where
the model explicitly gets the contextual precondition. However, it is unclear
if SOTA visual language models (VLMs) can extract such preconditions and infer
the affordance of objects with them. In this work, we introduce the task of
preconditioned visual language inference and rationalization (PVLIR). We
propose a learning resource based on three strategies to retrieve weak
supervision signals for the task and develop a human-verified test set for
evaluation. Our results reveal the shortcomings of SOTA VLM models in the task
and draw a road map to address the challenges ahead in improving them
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perfor
Visually Grounded Language Learning: a review of language games, datasets, tasks, and models
In recent years, several machine learning models have been proposed. They are
trained with a language modelling objective on large-scale text-only data. With
such pretraining, they can achieve impressive results on many Natural Language
Understanding and Generation tasks. However, many facets of meaning cannot be
learned by ``listening to the radio" only. In the literature, many
Vision+Language (V+L) tasks have been defined with the aim of creating models
that can ground symbols in the visual modality. In this work, we provide a
systematic literature review of several tasks and models proposed in the V+L
field. We rely on Wittgenstein's idea of `language games' to categorise such
tasks into 3 different families: 1) discriminative games, 2) generative games,
and 3) interactive games. Our analysis of the literature provides evidence that
future work should be focusing on interactive games where communication in
Natural Language is important to resolve ambiguities about object referents and
action plans and that physical embodiment is essential to understand the
semantics of situations and events. Overall, these represent key requirements
for developing grounded meanings in neural models.Comment: Preprint for JAIR before copyeditin
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perfor