3,237 research outputs found
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs
from traditional Textual Entailment (TE) tasks whereby a premise is defined by
an image, rather than a natural language sentence as in TE tasks. A novel
dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is
proposed for VE tasks based on the Stanford Natural Language Inference corpus
and Flickr30k. We introduce a differentiable architecture called the
Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and
several other state-of-the-art visual question answering (VQA) based models are
evaluated on the SNLI-VE dataset, facilitating grounded language understanding
and providing insights on how modern VQA based models perform.Comment: 4 pages, accepted by Visually Grounded Interaction and Language
(ViGIL) workshop in NeurIPS 201
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
We propose a weakly-supervised approach that takes image-sentence pairs as
input and learns to visually ground (i.e., localize) arbitrary linguistic
phrases, in the form of spatial attention masks. Specifically, the model is
trained with images and their associated image-level captions, without any
explicit region-to-phrase correspondence annotations. To this end, we introduce
an end-to-end model which learns visual groundings of phrases with two types of
carefully designed loss functions. In addition to the standard discriminative
loss, which enforces that attended image regions and phrases are consistently
encoded, we propose a novel structural loss which makes use of the parse tree
structures induced by the sentences. In particular, we ensure complementarity
among the attention masks that correspond to sibling noun phrases, and
compositionality of attention masks among the children and parent phrases, as
defined by the sentence parse tree. We validate the effectiveness of our
approach on the Microsoft COCO and Visual Genome datasets.Comment: CVPR 201
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
Existing visual reasoning datasets such as Visual Question Answering (VQA),
often suffer from biases conditioned on the question, image or answer
distributions. The recently proposed CLEVR dataset addresses these limitations
and requires fine-grained reasoning but the dataset is synthetic and consists
of similar objects and sentence structures across the dataset.
In this paper, we introduce a new inference task, Visual Entailment (VE) -
consisting of image-sentence pairs whereby a premise is defined by an image,
rather than a natural language sentence as in traditional Textual Entailment
tasks. The goal of a trained VE model is to predict whether the image
semantically entails the text. To realize this task, we build a dataset SNLI-VE
based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
We evaluate various existing VQA baselines and build a model called Explainable
Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71%
accuracy and outperforms several other state-of-the-art VQA based models.
Finally, we demonstrate the explainability of EVE through cross-modal attention
visualizations. The SNLI-VE dataset is publicly available at
https://github.com/ necla-ml/SNLI-VE
- …