35 research outputs found
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
Existing visual reasoning datasets such as Visual Question Answering (VQA),
often suffer from biases conditioned on the question, image or answer
distributions. The recently proposed CLEVR dataset addresses these limitations
and requires fine-grained reasoning but the dataset is synthetic and consists
of similar objects and sentence structures across the dataset.
In this paper, we introduce a new inference task, Visual Entailment (VE) -
consisting of image-sentence pairs whereby a premise is defined by an image,
rather than a natural language sentence as in traditional Textual Entailment
tasks. The goal of a trained VE model is to predict whether the image
semantically entails the text. To realize this task, we build a dataset SNLI-VE
based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
We evaluate various existing VQA baselines and build a model called Explainable
Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71%
accuracy and outperforms several other state-of-the-art VQA based models.
Finally, we demonstrate the explainability of EVE through cross-modal attention
visualizations. The SNLI-VE dataset is publicly available at
https://github.com/ necla-ml/SNLI-VE
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs
from traditional Textual Entailment (TE) tasks whereby a premise is defined by
an image, rather than a natural language sentence as in TE tasks. A novel
dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is
proposed for VE tasks based on the Stanford Natural Language Inference corpus
and Flickr30k. We introduce a differentiable architecture called the
Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and
several other state-of-the-art visual question answering (VQA) based models are
evaluated on the SNLI-VE dataset, facilitating grounded language understanding
and providing insights on how modern VQA based models perform.Comment: 4 pages, accepted by Visually Grounded Interaction and Language
(ViGIL) workshop in NeurIPS 201
A Context-aware Attention Network for Interactive Question Answering
Neural network based sequence-to-sequence models in an encoder-decoder
framework have been successfully applied to solve Question Answering (QA)
problems, predicting answers from statements and questions. However, almost all
previous models have failed to consider detailed context information and
unknown states under which systems do not have enough information to answer
given questions. These scenarios with incomplete or ambiguous information are
very common in the setting of Interactive Question Answering (IQA). To address
this challenge, we develop a novel model, employing context-dependent
word-level attention for more accurate statement representations and
question-guided sentence-level attention for better context modeling. We also
generate unique IQA datasets to test our model, which will be made publicly
available. Employing these attention mechanisms, our model accurately
understands when it can output an answer or when it requires generating a
supplementary question for additional input depending on different contexts.
When available, user's feedback is encoded and directly applied to update
sentence-level attention to infer an answer. Extensive experiments on QA and
IQA datasets quantitatively demonstrate the effectiveness of our model with
significant improvement over state-of-the-art conventional QA models.Comment: 9 page
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Human actions often involve complex interactions across several inter-related
objects in the scene. However, existing approaches to fine-grained video
understanding or visual relationship detection often rely on single object
representation or pairwise object relationships. Furthermore, learning
interactions across multiple objects in hundreds of frames for video is
computationally infeasible and performance may suffer since a large
combinatorial space has to be modeled. In this paper, we propose to efficiently
learn higher-order interactions between arbitrary subgroups of objects for
fine-grained video understanding. We demonstrate that modeling object
interactions significantly improves accuracy for both action recognition and
video captioning, while saving more than 3-times the computation over
traditional pairwise relationships. The proposed method is validated on two
large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and
SINet-Caption achieve state-of-the-art performances on both datasets even
though the videos are sampled at a maximum of 1 FPS. To the best of our
knowledge, this is the first work modeling object interactions on open domain
large-scale video datasets, and we additionally model higher-order object
interactions which improves the performance with low computational costs.Comment: CVPR 201